Statistical modelling

Today:

overview of key ideas about modelling.
understand the function lm()
know basic assumptions of linear models and how to assess them

Visual example

Does leaf temperature vary based on water stress in a wild type tomato?

# simple comparison model
download.file("https://timotheenivalis.github.io/data/droughtdata.csv", 
              destfile = "data/droughtdata.csv")

drought_data <- read.csv("data/droughtdata.csv")
str(drought_data)

## 'data.frame':    32 obs. of  4 variables:
##  $ plant         : int  1 1 1 1 2 2 2 2 3 3 ...
##  $ Genotype      : Factor w/ 2 levels "mutant","WT": 2 2 2 2 2 2 2 2 1 1 ...
##  $ WaterCondition: Factor w/ 2 levels "Drought","Normal": 2 2 2 2 2 2 2 2 2 2 ...
##  $ Temperature   : num  30 31 31.6 31.9 32 32.1 32.7 33.5 30 31 ...

wt_data <- drought_data[drought_data$Genotype=="WT",]


library(ggplot2)
library(plyr)
ggplot(wt_data, aes(x=WaterCondition, y=Temperature)) + 
  geom_boxplot() + geom_point() + geom_jitter()

ddply(.data = wt_data, .variables = c("WaterCondition"),
      .fun = function(x){summary(x$Temperature)})

##   WaterCondition Min. 1st Qu. Median    Mean 3rd Qu. Max.
## 1        Drought 35.8   37.00  38.85 38.5875   39.80 42.0
## 2         Normal 30.0   31.45  31.95 31.8500   32.25 33.5

No overlap at all between the temperatures of water conditions. Could this happen “by chance”?

To answer that we need implicitly or explicitly a model of the biological variation that generated these data.

A T-test? Sure, that works in this case.

t.test(x = wt_data[wt_data$WaterCondition == "Drought", "Temperature"],
       y= wt_data[wt_data$WaterCondition == "Normal", "Temperature"])

## 
##  Welch Two Sample t-test
## 
## data:  wt_data[wt_data$WaterCondition == "Drought", "Temperature"] and wt_data[wt_data$WaterCondition == "Normal", "Temperature"]
## t = 8.1716, df = 10.354, p-value = 7.85e-06
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  4.908888 8.566112
## sample estimates:
## mean of x mean of y 
##   38.5875   31.8500

But what model is that? What does it assume?

A t-test is a linear model and can be written as:

response variable = baseline + difference*(treatment == 1) + noise ; noise Normally distributed, independent

What is a statistical model

What is a model?

Formal but simplified representation of the world / an experiment
If the representation is good enough for our goal…
… we can learn something about the world from data

For instance a map is a model. It can have more or less details, different type of information, and be useful for different goals. For instance, using a world map or oceanic currents you could make sense of data about genetic structure in some algea species, because… but maybe you won’t explain everything because algea travel also attached to boat hulls; so maybe you would need a map of major maritime roads too.

Digression on good and bad models

Candies practical: https://timotheenivalis.github.io/data/candies.mp4

Naive distribution from a model that assumes items all have the same probability to be picked:

dat <- data.frame(burgers = rhyper(nn = 10000, m=5, n = 5, k = 5))

colfunc <- colorRampPalette(c("#FFA500", "#8B4726"))

ggplot(dat, aes(x=burgers, color=burgers)) + 
  geom_histogram(binwidth=0.5, stat="count",fill = colfunc(6)) +
       xlab("Number of burgers")+
       ylab("Count ") +
    scale_x_continuous(breaks = 0:5)+ theme_bw()+  theme(text = element_text(size=14))

## Warning: Ignoring unknown parameters: binwidth, bins, pad

Create a function that generates a distribution corresponding to a better model where a type of item is more likely to be picked.

fancyhyperg <- function(burger = 5, fruit = 5, meals = 5, relativeprob = 2)
{
  for(i in 1:meals)
  {
    meal <- sample(x = c(rep("burger", times=burger), rep("fruit", times=fruit)),
                   size = 1, prob = c(rep(relativeprob, burger), rep(1, fruit)))
    ifelse(meal=="burger", yes = burger <- burger - 1, 
           no = fruit <- fruit - 1)
  }
  nb_burger_meals <- meals - burger
return(nb_burger_meals)
}

Use this function to generate data where the burger items are 3 times more likely to be picked:

x <- c(0,1,replicate(n = 10000, expr = fancyhyperg(relativeprob = 3)))

dat <- data.frame(burgers = x)

colfunc <- colorRampPalette(c("#FFA500", "#8B4726"))

ggplot(dat, aes(x=burgers, color=burgers)) + 
  geom_histogram(binwidth=0.5, stat="count",fill = colfunc(6)) +
       xlab("Number of burgers")+
       ylab("Count ") +
    scale_x_continuous(breaks = 0:5)+ theme_bw()+  theme(text = element_text(size=14))

## Warning: Ignoring unknown parameters: binwidth, bins, pad

“All stats are lm”, Fitting linear models in R

You may have learned statistics with this type of flow charts: All models are linear models

All of this jungle can be fitted as linear models (in a broad sense). In this course we will try and avoid specific tests and instead build models that correspond to our questions and our data. This approach is more flexible, more insightful, and empowering on the long run (once you know the “syntax” of models you will be able to fit anything you need).

A statistical model consists of:

A QUESTION, that defines outcome measures:

Response variable = Measure of interest

Experimental factors also defined by the QUESTION:

Conditions that can be manipulated:
Conditions of interest (e.g. genotype, gender)
Main questions: do the conditions impact upon the outcome measure?

Blocking factors:

Conditions (not of interest) that may impact upon the outcome
Sources of variation in the experiment that need to be controlled for
Clustering of experimental units
“Random noise”

The components you include and the mathematical shape of relationships between components constitute model assumptions.

ggplot(wt_data, aes(x=WaterCondition, y=Temperature)) + 
  geom_boxplot() + geom_jitter() +
  geom_smooth(method = "lm", 
      mapping = aes(x=as.integer(WaterCondition), y=Temperature), inherit.aes = FALSE)

## `geom_smooth()` using formula 'y ~ x'

m0 <- lm(Temperature ~ WaterCondition, data = wt_data)
summary(m0)

## 
## Call:
## lm(formula = Temperature ~ WaterCondition, data = wt_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.7875 -0.9844  0.1000  1.0625  3.4125 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           38.5875     0.5830  66.187  < 2e-16 ***
## WaterConditionNormal  -6.7375     0.8245  -8.172 1.07e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.649 on 14 degrees of freedom
## Multiple R-squared:  0.8267, Adjusted R-squared:  0.8143 
## F-statistic: 66.78 on 1 and 14 DF,  p-value: 1.069e-06

library(glmmTMB)
m1 <- glmmTMB(Temperature ~ WaterCondition, 
              dispformula = ~ WaterCondition,
              data = wt_data)
summary(m1)

##  Family: gaussian  ( identity )
## Formula:          Temperature ~ WaterCondition
## Dispersion:                   ~WaterCondition
## Data: wt_data
## 
##      AIC      BIC   logLik deviance df.resid 
##     63.8     66.9    -27.9     55.8       12 
## 
## 
## Conditional model:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)           38.5875     0.6884   56.05   <2e-16 ***
## WaterConditionNormal  -6.7375     0.7712   -8.74   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Dispersion model:
##                      Estimate Std. Error z value Pr(>|z|)   
## (Intercept)            1.3327     0.5000   2.665  0.00769 **
## WaterConditionNormal  -1.3657     0.7071  -1.931  0.05344 . 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Anatomy of lm()

Response = Intercept + Slope1 × Predictor1+ Slope2 × Predictor2 + Error, with Error distributed following a random normal distribution.

You can fit that model as:

lm(response ~ 1 + predictor1 + predictor2, data=data)

or equivalently as

lm(response ~ predictor1 + predictor2, data=data)

or equivalently as

lm(response ~ predictor2 + predictor1, data=data)

Intercept can be explicit or implicit
Can remove intercept with …∼ 0 + …
Error is implicit
Feed the option data= to keep code short, reliable and flexible
Order of predictors do not matter

linear model assumptions and how to check them

Predictor not (almost) perfectly correlated Not very dangerous because quite obvious. Risk: Model won’t run, unstable convergence, or huge SE

dat <- data.frame(x1=1:10, x2=1:10, y= 1 + rnorm(10))
# 
lm(y ~ 1 + x1 + x2, data = dat)

## 
## Call:
## lm(formula = y ~ 1 + x1 + x2, data = dat)
## 
## Coefficients:
## (Intercept)           x1           x2  
##     0.99419      0.05897           NA

dat <- data.frame(x1=1:10, x2=1:10+rnorm(10,0,0.00001), y= 1 + rnorm(10))
summary(lm(y ~ 1 + x1 + x2, data = dat))

## 
## Call:
## lm(formula = y ~ 1 + x1 + x2, data = dat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.1801 -0.4497 -0.1023  0.2947  1.1086 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)
## (Intercept)  8.723e-01  5.641e-01   1.546    0.166
## x1          -3.176e+04  2.501e+04  -1.270    0.245
## x2           3.176e+04  2.501e+04   1.270    0.245
## 
## Residual standard error: 0.797 on 7 degrees of freedom
## Multiple R-squared:  0.1884, Adjusted R-squared:  -0.04344 
## F-statistic: 0.8127 on 2 and 7 DF,  p-value: 0.4815

Little error in predictors Most of the time not very dangerous. Sometimes can really bias results. Not easy to spot unless you have replicated measurements to check the reliability of your measurements. Risk: bias estimates (often underestimate with Gaussian error)

#simulation with true data
dat <- data.frame(xtrue = 1:100, y= 0.5*(1:100) + rnorm(100))
summary(lm(y ~ xtrue, data = dat))

## 
## Call:
## lm(formula = y ~ xtrue, data = dat)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.26668 -0.52708 -0.01435  0.71256  2.39675 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.488336   0.179510   -2.72  0.00772 ** 
## xtrue        0.507418   0.003086  164.42  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8908 on 98 degrees of freedom
## Multiple R-squared:  0.9964, Adjusted R-squared:  0.9964 
## F-statistic: 2.703e+04 on 1 and 98 DF,  p-value: < 2.2e-16

# small measurement error
dat$xwitherror <- dat$xtrue + rnorm(100, sd = 1)
summary(lm(y ~ xwitherror, data = dat))

## 
## Call:
## lm(formula = y ~ xwitherror, data = dat)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.57482 -0.67548  0.03332  0.65039  2.31306 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.485932   0.200957  -2.418   0.0174 *  
## xwitherror   0.506880   0.003452 146.848   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.997 on 98 degrees of freedom
## Multiple R-squared:  0.9955, Adjusted R-squared:  0.9954 
## F-statistic: 2.156e+04 on 1 and 98 DF,  p-value: < 2.2e-16

# strong measurement error
dat$xwitherror <- dat$xtrue + rnorm(100, sd = 10)
summary(lm(y ~ xwitherror, data = dat))

## 
## Call:
## lm(formula = y ~ xwitherror, data = dat)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11.7178  -2.8604  -0.0738   3.3254  13.0353 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.18447    0.92376   3.447 0.000836 ***
## xwitherror   0.43876    0.01562  28.092  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.927 on 98 degrees of freedom
## Multiple R-squared:  0.8895, Adjusted R-squared:  0.8884 
## F-statistic: 789.2 on 1 and 98 DF,  p-value: < 2.2e-16

# very strong measurement error: the effect is almost gone
dat$xwitherror <- dat$xtrue + rnorm(100, sd = 100)
summary(lm(y ~ xwitherror, data = dat))

## 
## Call:
## lm(formula = y ~ xwitherror, data = dat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -26.166 -11.203   1.336  11.610  28.240 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 23.94030    1.63182   14.67   <2e-16 ***
## xwitherror   0.02068    0.01253    1.65    0.102    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14.62 on 98 degrees of freedom
## Multiple R-squared:  0.02704,    Adjusted R-squared:  0.01711 
## F-statistic: 2.724 on 1 and 98 DF,  p-value: 0.1021

Gaussian error distribution Easy to spot, but often not crucial. Small deviations from Gaussian (a.k.a. normal distribution do not matter in my experience) Risk: Poor predictions

#simulation
x <- rnorm(1000)
y <- exp(0.2 + 0.5*x + rnorm(1000,0,0.1))
dat <- data.frame(x=x, y=y)

#what you see:
mexp <- lm(y ~ x)
plot(mexp, which = 2)

Homoscedasticity (constant error variance) Quite dangerous, but easy to spot.

Risk: Over-optimistic uncertainty, unreliable predictions

#simulation:
x <- rnorm(1000)
y <- exp(0.2 + 0.5*x + rnorm(1000,0,0.1))
dat <- data.frame(x=x, y=y)
plot(x,y)

#what you see:
mexp <- lm(y ~ x)
plot(mexp, which = 3)

Independence of noise among data points That’s the most dangerous problem in my opinion. It almost always present and sometimes difficult to identify and correct.

Risk: Bias and over-optimistic uncertainty

If you are luck you could diagnose the problem by looking for bands in the residuals on diagnostic plots:

# simulation
re <- rep(rnorm(n = 25, mean = 0, sd = 3), each=40)
x <- rnorm(1000)
y <- 0.2*x + re
plot(x,y)

#what you see:
mi <- lm(y ~ x)
plot(mi)