Linear models 2: random effects and variance structure

Today we learn:

Thinking about the structure of residual variance is an integral part of critical thinking:
- Risks of biased inference (false positive or false negative)
- Risks of incorrect uncertainty (over-6 or under-confident)
Random effects are a powerful way to capture unexplained patterns and build more robust inference
How to interpret mixed effect model output
How to visualise mixed effect model results
How to test the significance of random effects

Setting up

Packages for today:

library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──

## ✓ ggplot2 3.3.3     ✓ purrr   0.3.4
## ✓ tibble  3.1.2     ✓ dplyr   1.0.6
## ✓ tidyr   1.1.3     ✓ stringr 1.4.0
## ✓ readr   1.4.0     ✓ forcats 0.5.1

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(glmmTMB)
library(emmeans)
library(DHARMa)

## This is DHARMa 0.4.1. For overview type '?DHARMa'. For recent changes, type news(package = 'DHARMa') Note: Syntax of plotResiduals has changed in 0.3.0, see ?plotResiduals for details

Data for today:

download.file(url = "https://timotheenivalis.github.io/workshops/Stats21/Data.zip",
              destfile = "Data.zip")

unzip(zipfile = "Data.zip")

Reminder

General guidelines:

Define your scientific question(s)
- Response variable
- Experimental factors
- Blocking/confounding factors
Visualise your data in a way that answers your question
Fit corresponding model to put numbers on what you see
Check assumptions and think about how much you trust your results

Looking again at drought data

Last week we found that different genotypes responded differently to water conditions, that is, there was an interaction between genotype and water condition.

(NB: I have tweaked the dataset a little to make the patterns a bit clearer; so don’t worry if you see slight changes in the plots compared to last week!)

drought_data <- read_csv("Data/droughtdata2.csv")

## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   plant = col_double(),
##   Genotype = col_character(),
##   WaterCondition = col_character(),
##   Temperature = col_double()
## )

ggplot(drought_data, aes(x = WaterCondition, y= Temperature, color=Genotype)) +
  geom_boxplot()

m_inter <- glmmTMB(Temperature ~ Genotype * WaterCondition, data = drought_data)

## Warning in Matrix::sparseMatrix(dims = c(0, 0), i = integer(0), j =
## integer(0), : 'giveCsparse' has been deprecated; setting 'repr = "T"' for you

## Warning in Matrix::sparseMatrix(dims = c(0, 0), i = integer(0), j =
## integer(0), : 'giveCsparse' has been deprecated; setting 'repr = "T"' for you

## Warning in Matrix::sparseMatrix(dims = c(0, 0), i = integer(0), j =
## integer(0), : 'giveCsparse' has been deprecated; setting 'repr = "T"' for you

summary(m_inter)

##  Family: gaussian  ( identity )
## Formula:          Temperature ~ Genotype * WaterCondition
## Data: drought_data
## 
##      AIC      BIC   logLik deviance df.resid 
##    141.5    148.8    -65.7    131.5       27 
## 
## 
## Dispersion estimate for gaussian family (sigma^2): 3.56 
## 
## Conditional model:
##                                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                      32.3875     0.6672   48.54  < 2e-16 ***
## GenotypeWT                        6.0375     0.9435    6.40 1.57e-10 ***
## WaterConditionNormal             -0.2250     0.9435   -0.24    0.812    
## GenotypeWT:WaterConditionNormal  -6.3125     1.3343   -4.73 2.24e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

inter_prediction <- as_tibble(emmeans(m_inter, ~ WaterCondition * Genotype))

ggplot(drought_data, aes(x = WaterCondition, y= Temperature, color=Genotype)) +
  geom_jitter(alpha=0.3, width=0.1, height = 0) +
  geom_point(data=inter_prediction, aes(y=emmean), size=3, position = position_dodge(width = 1)) +
  geom_errorbar(data=inter_prediction,
                aes(ymin=lower.CL, ymax=upper.CL,x=WaterCondition, color=Genotype),
                width=0.1, inherit.aes = FALSE, position = position_dodge(width = 1))

We ignored one thing though: temperature was measured on leaves from the same plants. For each treatment we have 8 data points but have only use two plants. This exposes us to a potential problem:

Statistically, we have assumed the residuals were independent of each others; that is, there was no structure in them. This assumption is likely wrong because measurements from the same plant are likely more similar to each others that to measurements taken on different plants. Our inference may or may not be invalid because of this assumption being violated.
You can rephrase the statistical problem as a problem of logic or rigour: how much can we be sure an effect is general if we see it only in two individuals?

ggplot(drought_data, aes(x=interaction(Genotype, WaterCondition),
    y=Temperature, color=as.factor(plant)))+
  geom_point()+ 
  xlab("Genotype-by-exposure") + scale_color_brewer(palette="Dark2")

We do not really know what properties of the plants (apart from genotype and watering condition) may produce differences in temperature, and we have not measured much else about them. That is a typical case when to use a random effect to model individual differences.

A random intercept effect of plant asks the model to partition the residual variance into differences among plants and differences within plants. If plants are different from each other on average then we need to account for it to obtain a result that we may generalise to other plants.

m_inter_RE <- glmmTMB(Temperature ~ 1 + Genotype*WaterCondition + (1|plant), data = drought_data)

## Warning in Matrix::sparseMatrix(dims = c(0, 0), i = integer(0), j =
## integer(0), : 'giveCsparse' has been deprecated; setting 'repr = "T"' for you

## Warning in Matrix::sparseMatrix(dims = c(0, 0), i = integer(0), j =
## integer(0), : 'giveCsparse' has been deprecated; setting 'repr = "T"' for you

## Warning in Matrix::sparseMatrix(dims = c(0, 0), i = integer(0), j =
## integer(0), : 'giveCsparse' has been deprecated; setting 'repr = "T"' for you

summary(m_inter_RE)

##  Family: gaussian  ( identity )
## Formula:          Temperature ~ 1 + Genotype * WaterCondition + (1 | plant)
## Data: drought_data
## 
##      AIC      BIC   logLik deviance df.resid 
##    107.0    115.8    -47.5     95.0       26 
## 
## Random effects:
## 
## Conditional model:
##  Groups   Name        Variance Std.Dev.
##  plant    (Intercept) 3.0509   1.7467  
##  Residual             0.5101   0.7142  
## Number of obs: 32, groups:  plant, 8
## 
## Dispersion estimate for gaussian family (sigma^2): 0.51 
## 
## Conditional model:
##                                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                       32.388      1.261  25.691  < 2e-16 ***
## GenotypeWT                         6.037      1.783   3.387 0.000708 ***
## WaterConditionNormal              -0.225      1.783  -0.126 0.899569    
## GenotypeWT:WaterConditionNormal   -6.312      2.521  -2.504 0.012290 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Check the model correctly captured differences in average plant temperatures by looking at estimated plant deviations (they are called BLUPs = Best Linear Unbiased Predictions):

ranef(m_inter_RE)$cond$plant

The interaction is still apparent but there is much more uncertainty about its strength: the SE and p-value are much larger. The variance captured by plant is also very large: 1.75, vs. 0.71 for the residual. On average there is more differences among plants mean temperatures than within plant measurements.

Let’s visualise the different predictions made by the models with and without random effects (the code is not great, don’t worry to much about it, the point is just to create a plot to illustrate the statistical patter):

inter_RE_prediction <- as_tibble(emmeans(m_inter_RE, ~ WaterCondition * Genotype))

inter_RE_prediction$RandomEffect <- TRUE
inter_prediction$RandomEffect <- FALSE

predictionsRE <- rbind(inter_prediction, inter_RE_prediction)

ggplot(drought_data, aes(x=interaction(Genotype, WaterCondition),
    y=Temperature, color=as.factor(plant))) +
  geom_jitter(alpha=0.8, width=0.1, height = 0) +
  geom_point(data=predictionsRE, aes(y=emmean,x=interaction(Genotype, WaterCondition), stre=RandomEffect),
             size=3, inherit.aes = FALSE, position = position_dodge(width = 1))+
  geom_errorbar(data=predictionsRE,
                aes(ymin=lower.CL, ymax=upper.CL, x=interaction(Genotype, WaterCondition), stre=RandomEffect),
                width=0.1, inherit.aes = FALSE, position = position_dodge(width = 1)) +
  xlab("Genotype-by-exposure") + scale_color_brewer(palette="Dark2")

## Warning: Ignoring unknown aesthetics: stre

## Warning: Ignoring unknown aesthetics: stre

Here including a random effect changes very little the predicted means, but changes considerably the error bars and the p-value. We still find evidence of an interaction but we have much less confidence about how strong it is. That makes sense, after all our experiment considers only 2 plants per treatment! It is very possible these two plants are not representative!

To convince ourselves the model captures variation well, we can also plot plant specific predictions, but we need the predict function for that:

newdatadrought <- drought_data[!duplicated(drought_data$plant),]

#make a prediction for each random effect level ("conditional"):
preoutput <- predict(m_inter_RE, newdata =newdatadrought, 
        se.fit = TRUE, re.form = NULL)

## Warning in Matrix::sparseMatrix(dims = c(0, 0), i = integer(0), j =
## integer(0), : 'giveCsparse' has been deprecated; setting 'repr = "T"' for you

## Warning in Matrix::sparseMatrix(dims = c(0, 0), i = integer(0), j =
## integer(0), : 'giveCsparse' has been deprecated; setting 'repr = "T"' for you

## Warning in Matrix::sparseMatrix(dims = c(0, 0), i = integer(0), j =
## integer(0), : 'giveCsparse' has been deprecated; setting 'repr = "T"' for you

newdatadrought$Temperature <- preoutput$fit
newdatadrought$lower.CL <- preoutput$fit - 2*preoutput$se.fit
newdatadrought$upper.CL <- preoutput$fit + 2*preoutput$se.fit

#reproduce the emmeans result by averaging across the random effect level ("marginal"):
predict(m_inter_RE, newdata = expand_grid(WaterCondition=unique(drought_data$WaterCondition), 
                                          Genotype=unique(drought_data$Genotype),
                                          plant=NA), se.fit = TRUE, re.form = NA)

## Warning in Matrix::sparseMatrix(dims = c(0, 0), i = integer(0), j =
## integer(0), : 'giveCsparse' has been deprecated; setting 'repr = "T"' for you

## Warning in Matrix::sparseMatrix(dims = c(0, 0), i = integer(0), j =
## integer(0), : 'giveCsparse' has been deprecated; setting 'repr = "T"' for you

## Warning in Matrix::sparseMatrix(dims = c(0, 0), i = integer(0), j =
## integer(0), : 'giveCsparse' has been deprecated; setting 'repr = "T"' for you

## $fit
## mu_predict mu_predict mu_predict mu_predict 
##    31.8875    32.1625    38.4250    32.3875 
## 
## $se.fit
## mu_predict mu_predict mu_predict mu_predict 
##   1.260634   1.260634   1.260634   1.260634

ggplot(drought_data, aes(x=interaction(Genotype, WaterCondition),
    y=Temperature, color=as.factor(plant))) +
  geom_jitter(alpha=0.8, width=0.1, height = 0) +
  geom_point(data=predictionsRE, aes(y=emmean,x=interaction(Genotype, WaterCondition), stre=RandomEffect),
             size=3, inherit.aes = FALSE, position = position_dodge(width = 1))+
  geom_errorbar(data=predictionsRE,
                aes(ymin=lower.CL, ymax=upper.CL, x=interaction(Genotype, WaterCondition), stre=RandomEffect),
                width=0.1, inherit.aes = FALSE, position = position_dodge(width = 1)) +
  geom_point(data=newdatadrought, shape=17, size=3, aes(x=interaction(Genotype, WaterCondition), 
                                                        y=Temperature,  color=as.factor(plant)), inherit.aes=FALSE)+ 
  geom_errorbar(data=newdatadrought, aes(x=interaction(Genotype, WaterCondition), 
                                                        ymin=lower.CL, ymax=upper.CL,  color=as.factor(plant)), inherit.aes=FALSE, width=0.1)+ 
  xlab("Genotype-by-exposure") + scale_color_brewer(palette="Dark2")

## Warning: Ignoring unknown aesthetics: stre

## Warning: Ignoring unknown aesthetics: stre

Including a random effect shows us that we are very confident about the average value for each plant, but not for the average value for a broader unobserved population.

By the way, this final model shows some small problems with residual distributions:

plot(simulateResiduals(m_inter_RE))

## Warning in Matrix::sparseMatrix(dims = c(0, 0), i = integer(0), j =
## integer(0), : 'giveCsparse' has been deprecated; setting 'repr = "T"' for you

## Warning in Matrix::sparseMatrix(dims = c(0, 0), i = integer(0), j =
## integer(0), : 'giveCsparse' has been deprecated; setting 'repr = "T"' for you

## Warning in Matrix::sparseMatrix(dims = c(0, 0), i = integer(0), j =
## integer(0), : 'giveCsparse' has been deprecated; setting 'repr = "T"' for you

## Warning in Matrix::sparseMatrix(dims = c(0, 0), i = integer(0), j =
## integer(0), : 'giveCsparse' has been deprecated; setting 'repr = "T"' for you

## Warning in Matrix::sparseMatrix(dims = c(0, 0), i = integer(0), j =
## integer(0), : 'giveCsparse' has been deprecated; setting 'repr = "T"' for you

## Warning in Matrix::sparseMatrix(dims = c(0, 0), i = integer(0), j =
## integer(0), : 'giveCsparse' has been deprecated; setting 'repr = "T"' for you

It may be that there is more variation in drought conditions than in normal conditions. Perhaps we could still improve the model by accounting for unequal variances, but given our small sample size that would probably be difficult. Here I think the next thing to do would rather be to collect more data, across more plants.

Thorns and Simpson paradox

Imagine you have measured how much a plant species is attacked by big herbivores as a function of how many thorns the plant grows on stems. You have collected data in 5 locations and visualise the relationship between herbivory and quantity of thorns

thorndata <- read_csv("Data/thorndata.csv")

## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   herbivory = col_double(),
##   thorndensity = col_double(),
##   site = col_character()
## )

ggplot(aes(x=thorndensity, y= herbivory), data = thorndata) +  geom_point()+
  geom_smooth(method = "lm")

## `geom_smooth()` using formula 'y ~ x'

Counter-intuitive relationship.

m0 <- glmmTMB(herbivory ~thorndensity , data=thorndata)

## Warning in Matrix::sparseMatrix(dims = c(0, 0), i = integer(0), j =
## integer(0), : 'giveCsparse' has been deprecated; setting 'repr = "T"' for you

## Warning in Matrix::sparseMatrix(dims = c(0, 0), i = integer(0), j =
## integer(0), : 'giveCsparse' has been deprecated; setting 'repr = "T"' for you

## Warning in Matrix::sparseMatrix(dims = c(0, 0), i = integer(0), j =
## integer(0), : 'giveCsparse' has been deprecated; setting 'repr = "T"' for you

summary(m0)

##  Family: gaussian  ( identity )
## Formula:          herbivory ~ thorndensity
## Data: thorndata
## 
##      AIC      BIC   logLik deviance df.resid 
##    223.5    231.3   -108.7    217.5       97 
## 
## 
## Dispersion estimate for gaussian family (sigma^2): 0.515 
## 
## Conditional model:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   3.25642    0.28249  11.528  < 2e-16 ***
## thorndensity  0.25237    0.07098   3.556 0.000377 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Counter-intuitive result is confirmed.

Residual checks:

plot(simulateResiduals(m0))

## Warning in Matrix::sparseMatrix(dims = c(0, 0), i = integer(0), j =
## integer(0), : 'giveCsparse' has been deprecated; setting 'repr = "T"' for you

## Warning in Matrix::sparseMatrix(dims = c(0, 0), i = integer(0), j =
## integer(0), : 'giveCsparse' has been deprecated; setting 'repr = "T"' for you

## Warning in Matrix::sparseMatrix(dims = c(0, 0), i = integer(0), j =
## integer(0), : 'giveCsparse' has been deprecated; setting 'repr = "T"' for you

## Warning in Matrix::sparseMatrix(dims = c(0, 0), i = integer(0), j =
## integer(0), : 'giveCsparse' has been deprecated; setting 'repr = "T"' for you

## Warning in Matrix::sparseMatrix(dims = c(0, 0), i = integer(0), j =
## integer(0), : 'giveCsparse' has been deprecated; setting 'repr = "T"' for you

## Warning in Matrix::sparseMatrix(dims = c(0, 0), i = integer(0), j =
## integer(0), : 'giveCsparse' has been deprecated; setting 'repr = "T"' for you

No significant deviation, but patterns apparent in residuals.

It turns out, the data were collected across five locations!

ggplot(aes(x=thorndensity, y= herbivory, color=site), data = thorndata) +  geom_point()+
  geom_smooth(method = "lm")

## `geom_smooth()` using formula 'y ~ x'

There is a very strong pattern of differences among sites in both the predictor and the response. We don’t really know why, we don’t really care about the sites (we don’t have hypotheses about why sites differ), so we can probably model site effectively with a random effect:

mm1 <- glmmTMB(herbivory ~  thorndensity + (1|site), data=thorndata)

## Warning in Matrix::sparseMatrix(dims = c(0, 0), i = integer(0), j =
## integer(0), : 'giveCsparse' has been deprecated; setting 'repr = "T"' for you

## Warning in Matrix::sparseMatrix(dims = c(0, 0), i = integer(0), j =
## integer(0), : 'giveCsparse' has been deprecated; setting 'repr = "T"' for you

## Warning in Matrix::sparseMatrix(dims = c(0, 0), i = integer(0), j =
## integer(0), : 'giveCsparse' has been deprecated; setting 'repr = "T"' for you

summary(mm1)

##  Family: gaussian  ( identity )
## Formula:          herbivory ~ thorndensity + (1 | site)
## Data: thorndata
## 
##      AIC      BIC   logLik deviance df.resid 
##    172.1    182.5    -82.0    164.1       96 
## 
## Random effects:
## 
## Conditional model:
##  Groups   Name        Variance Std.Dev.
##  site     (Intercept) 1.6477   1.2836  
##  Residual             0.2358   0.4856  
## Number of obs: 100, groups:  site, 5
## 
## Dispersion estimate for gaussian family (sigma^2): 0.236 
## 
## Conditional model:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)    7.7080     0.8014   9.618  < 2e-16 ***
## thorndensity  -0.9041     0.1447  -6.247  4.2e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Now the relationship makes sense!

Let’s visualise model predictions:

pred_mm1 <- as_tibble(
  emmeans(
    mm1, ~ thorndensity, at=list(
      thorndensity=seq(min(thorndata$thorndensity), max(thorndata$thorndensity), length.out=100)
                                                        )
                              )
                      )

ggplot(aes(x=thorndensity, y= herbivory, color=site), data = thorndata) +
  geom_point()+
  geom_line(data = pred_mm1, aes(y=emmean, x=thorndensity), inherit.aes = FALSE)+
  geom_ribbon(data = pred_mm1, aes(ymin=lower.CL, ymax=upper.CL, x=thorndensity), inherit.aes = FALSE, alpha=0.1)

emmeans is not great with random effects because it can show only the relationship across sites (as if we did not know where we observe the data). The uncertainty includes variance among random effects.

To work with mixed models predict() may be better, because we can see the relationship within each site:

newdata <- expand_grid(thorndensity = seq(min(thorndata$thorndensity), max(thorndata$thorndensity), length.out=100),
                        site=unique(thorndata$site))
predoutput <- predict(mm1, newdata = newdata, se.fit = TRUE)

## Warning in Matrix::sparseMatrix(dims = c(0, 0), i = integer(0), j =
## integer(0), : 'giveCsparse' has been deprecated; setting 'repr = "T"' for you

## Warning in Matrix::sparseMatrix(dims = c(0, 0), i = integer(0), j =
## integer(0), : 'giveCsparse' has been deprecated; setting 'repr = "T"' for you

## Warning in Matrix::sparseMatrix(dims = c(0, 0), i = integer(0), j =
## integer(0), : 'giveCsparse' has been deprecated; setting 'repr = "T"' for you

newdata$herbivory <- predoutput$fit
newdata$lower.CL <- predoutput$fit - 2*predoutput$se.fit
newdata$upper.CL <- predoutput$fit + 2*predoutput$se.fit

ggplot(aes(x=thorndensity, y= herbivory, color=site), data = thorndata) +
  geom_point()+
  geom_line(data=newdata) + 
  geom_ribbon(data = newdata, aes(x=thorndensity, ymin=lower.CL, ymax=upper.CL, fill=site), inherit.aes = FALSE, alpha=0.1)

NB: How to choose between fixed vs. random effect?

First, if the predictor is a numerical variable (like size, temperature, number of species) you will generally want a fixed effect, because the order and distance between values is likely meaningful. A random effect consider values as random “names” for grouping levels.
In general it does not change inference much. Random effects are slightly more efficient because the differences between grouping levels are assumed to come from a normal distribution, instead of being estimated independently from each other.
A slightly lame but practical reason to use random effects: models output are cleaner with random effects because you get a single parameter estimate for a grouping variable modeled as a random effect, instead of a parameter estimate for each level of the grouping variable when modeled with a fixed effect.
Using a random effect instead of a fixed effect shifts the focus from the effect of each grouping level to variation among grouping levels. What is more interesting to you?
Often we use random effect to “correct” for some structure in the data that is not of interest. But random variance parameters can be of interest in themselves. For instance they can quantify genetic variation, individual repeatability, niche specialisationif a variance parameter is of interest then estimate it using a random effect.
On the other hand, if you are very interested in how different a particular grouping level is from other levels, use fixed effects.
Be careful with random effects if the number of grouping levels is small. You definitely need at least 3, probably at least 5, maybe at least 10With few grouping levels the fitting algorithm could have difficulties estimating the random variance, leading to unstable results (you may get different results from different algorithms) or the random variance being estimated to exactly zero.

Vole example

Snow vole (that is a rodent) data. What explains variation in body mass?

voles <- read_csv("Data/voles.csv")

## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   id = col_character(),
##   mass = col_double(),
##   body_length = col_double(),
##   tail_length = col_double(),
##   sex = col_character(),
##   age = col_character(),
##   year = col_double(),
##   x_coord = col_double(),
##   y_coord = col_double(),
##   day = col_double(),
##   inbreeding = col_double()
## )

voles <- na.omit(voles)

Last week we got this model:

m_vole1 <- glmmTMB(mass ~ 1 + body_length + age, data=voles)

## Warning in Matrix::sparseMatrix(dims = c(0, 0), i = integer(0), j =
## integer(0), : 'giveCsparse' has been deprecated; setting 'repr = "T"' for you

## Warning in Matrix::sparseMatrix(dims = c(0, 0), i = integer(0), j =
## integer(0), : 'giveCsparse' has been deprecated; setting 'repr = "T"' for you

## Warning in Matrix::sparseMatrix(dims = c(0, 0), i = integer(0), j =
## integer(0), : 'giveCsparse' has been deprecated; setting 'repr = "T"' for you

summary(m_vole1)

##  Family: gaussian  ( identity )
## Formula:          mass ~ 1 + body_length + age
## Data: voles
## 
##      AIC      BIC   logLik deviance df.resid 
##  12461.7  12484.6  -6226.9  12453.7     2215 
## 
## 
## Dispersion estimate for gaussian family (sigma^2):   16 
## 
## Conditional model:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -27.64520    1.62558  -17.01   <2e-16 ***
## body_length   0.59794    0.01393   42.92   <2e-16 ***
## ageJ         -5.06032    0.30768  -16.45   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

pred_voles1 <- as_tibble(emmeans(m_vole1, spec = ~ age + body_length,
        at=list(body_length=seq(min(voles$body_length, na.rm = TRUE),
                                max(voles$body_length, na.rm = TRUE)))))

pred_voles1  <- pred_voles1 %>% filter((age=="A" & body_length>100 ) | (age=="J" & body_length<120) )

ggplot(voles, aes(x=body_length, y=mass, color=age)) + geom_point() + 
  geom_line(data = pred_voles1, aes(y=emmean), size=1) +
  geom_ribbon(data = pred_voles1, aes(ymin=lower.CL, ymax=upper.CL, x=body_length, fill=age),
              inherit.aes = FALSE, alpha=0.3)

That model is maybe not reliable because it omits several sources of non-independence between residuals. In particular we measured the same individuals repeatedly and we measured individuals across several years.

voles$resid1 <- residuals(m_vole1)

## Warning in Matrix::sparseMatrix(dims = c(0, 0), i = integer(0), j =
## integer(0), : 'giveCsparse' has been deprecated; setting 'repr = "T"' for you

## Warning in Matrix::sparseMatrix(dims = c(0, 0), i = integer(0), j =
## integer(0), : 'giveCsparse' has been deprecated; setting 'repr = "T"' for you

## Warning in Matrix::sparseMatrix(dims = c(0, 0), i = integer(0), j =
## integer(0), : 'giveCsparse' has been deprecated; setting 'repr = "T"' for you

voles %>%
  group_by(year) %>%
  summarise(mean(resid1, na.rm=TRUE))

voles %>%
  group_by(id) %>%
  summarise(indmean=mean(resid1, na.rm=TRUE)) %>%
  ggplot(aes(x=indmean)) + geom_density()

voles %>%
  group_by(id) %>%
  summarise(indmean=mean(resid1, na.rm=TRUE), yearmean=mean(year)) %>%
  ggplot(aes(x=yearmean, y=indmean)) + geom_point()

We can try to account for those sources of structure with random effects:

m_voles2 <- glmmTMB(mass ~ 1 + body_length + age + (1|id) + (1|year), data=voles)

## Warning in Matrix::sparseMatrix(dims = c(0, 0), i = integer(0), j =
## integer(0), : 'giveCsparse' has been deprecated; setting 'repr = "T"' for you

## Warning in Matrix::sparseMatrix(dims = c(0, 0), i = integer(0), j =
## integer(0), : 'giveCsparse' has been deprecated; setting 'repr = "T"' for you

## Warning in Matrix::sparseMatrix(dims = c(0, 0), i = integer(0), j =
## integer(0), : 'giveCsparse' has been deprecated; setting 'repr = "T"' for you

summary(m_voles2)

##  Family: gaussian  ( identity )
## Formula:          mass ~ 1 + body_length + age + (1 | id) + (1 | year)
## Data: voles
## 
##      AIC      BIC   logLik deviance df.resid 
##  12108.8  12143.0  -6048.4  12096.8     2213 
## 
## Random effects:
## 
## Conditional model:
##  Groups   Name        Variance Std.Dev.
##  id       (Intercept)  4.2607  2.0641  
##  year     (Intercept)  0.4321  0.6573  
##  Residual             10.9173  3.3041  
## Number of obs: 2219, groups:  id, 677; year, 7
## 
## Dispersion estimate for gaussian family (sigma^2): 10.9 
## 
## Conditional model:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -19.64876    1.65228  -11.89   <2e-16 ***
## body_length   0.52935    0.01404   37.70   <2e-16 ***
## ageJ         -6.25393    0.31801  -19.67   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The parameter estimates and standard errors changed very little, but the variance components seem quite large, so we are capturing residual variation.

Does it mean that on some years voles were on average heavier (given their length)? And that some individuals are consistently heavier through measurements?

To test these ideas we need to learn new tools to work with uncertainty in random effect parameters.

Tests and uncertainty for random effects

Likelihood Ratio Test (LRT): comparing a model with and one without the random effect of interest:

m_voles2_minusYear <-  glmmTMB(mass ~ 1 + body_length + age + (1|id), data=voles)

## Warning in Matrix::sparseMatrix(dims = c(0, 0), i = integer(0), j =
## integer(0), : 'giveCsparse' has been deprecated; setting 'repr = "T"' for you

## Warning in Matrix::sparseMatrix(dims = c(0, 0), i = integer(0), j =
## integer(0), : 'giveCsparse' has been deprecated; setting 'repr = "T"' for you

## Warning in Matrix::sparseMatrix(dims = c(0, 0), i = integer(0), j =
## integer(0), : 'giveCsparse' has been deprecated; setting 'repr = "T"' for you

anova(m_voles2, m_voles2_minusYear)

m_voles2_minusId <-  glmmTMB(mass ~ 1 + body_length + age + (1|year), data=voles)

## Warning in Matrix::sparseMatrix(dims = c(0, 0), i = integer(0), j =
## integer(0), : 'giveCsparse' has been deprecated; setting 'repr = "T"' for you

## Warning in Matrix::sparseMatrix(dims = c(0, 0), i = integer(0), j =
## integer(0), : 'giveCsparse' has been deprecated; setting 'repr = "T"' for you

## Warning in Matrix::sparseMatrix(dims = c(0, 0), i = integer(0), j =
## integer(0), : 'giveCsparse' has been deprecated; setting 'repr = "T"' for you

anova(m_voles2, m_voles2_minusId)

More visualisation

Year effects

newdata2 <- expand_grid(body_length=seq(min(voles$body_length, na.rm = TRUE),
                                max(voles$body_length, na.rm = TRUE)), age=c("A", "J"),
                        year=unique(voles$year), id=voles$id[1])
predoutput <- predict(m_voles2, newdata = newdata2, se.fit = TRUE, re.form = NULL)

## Warning in Matrix::sparseMatrix(dims = c(0, 0), i = integer(0), j =
## integer(0), : 'giveCsparse' has been deprecated; setting 'repr = "T"' for you

## Warning in Matrix::sparseMatrix(dims = c(0, 0), i = integer(0), j =
## integer(0), : 'giveCsparse' has been deprecated; setting 'repr = "T"' for you

## Warning in Matrix::sparseMatrix(dims = c(0, 0), i = integer(0), j =
## integer(0), : 'giveCsparse' has been deprecated; setting 'repr = "T"' for you

newdata2$mass <- predoutput$fit
newdata2$lower.CL <- predoutput$fit - 2* predoutput$se.fit
newdata2$upper.CL <- predoutput$fit
  
newdata2  <- newdata2 %>% filter((age=="A" & body_length>100 ) | (age=="J" & body_length<120) )

ggplot(voles, aes(x=body_length, y=mass, color=as.factor(year)) ) + geom_point() + 
  geom_line(data = newdata2, aes(x=body_length, color=as.factor(year)) ) + facet_wrap(~age, scales = "free") #+

  #geom_ribbon(data = newdata2, aes(ymin=lower.CL, ymax=upper.CL, x=body_length, fill=as.factor(year)),
   #           inherit.aes = FALSE, alpha=0.1)

newdata2b <- expand_grid(body_length=seq(min(voles$body_length, na.rm = TRUE),
                                max(voles$body_length, na.rm = TRUE)), age=c("A", "J"),
                        year=voles$year[1], id=unique(voles$id))
newdata2b$mass <- predict(m_voles2, newdata = newdata2b, re.form = NULL)

## Warning in Matrix::sparseMatrix(dims = c(0, 0), i = integer(0), j =
## integer(0), : 'giveCsparse' has been deprecated; setting 'repr = "T"' for you

## Warning in Matrix::sparseMatrix(dims = c(0, 0), i = integer(0), j =
## integer(0), : 'giveCsparse' has been deprecated; setting 'repr = "T"' for you

## Warning in Matrix::sparseMatrix(dims = c(0, 0), i = integer(0), j =
## integer(0), : 'giveCsparse' has been deprecated; setting 'repr = "T"' for you

newdata2b  <- newdata2b %>% filter((age=="A" & body_length>100 ) | (age=="J" & body_length<120) )

ggplot(voles, aes(x=body_length, y=mass) ) + geom_point() + 
  geom_line(data = newdata2b, aes(x=body_length, color=id), alpha=0.1 ) +
  facet_wrap(~age, scales = "free") + 
  theme(legend.position = "none")

If time allows, more complex random effects

Does sex-difference varies among years?

ggplot(voles, aes(x=body_length, y=mass, color=interaction(age,sex)) ) +
  geom_point(alpha=0.1)+ 
  geom_smooth(method = "lm") +
  facet_wrap(~year)

## `geom_smooth()` using formula 'y ~ x'

It looks pretty consistent.

m_voles3 <- glmmTMB(mass ~ 1 + body_length + sex + age + (1|id) + (1|year), data=voles)

## Warning in Matrix::sparseMatrix(dims = c(0, 0), i = integer(0), j =
## integer(0), : 'giveCsparse' has been deprecated; setting 'repr = "T"' for you

## Warning in Matrix::sparseMatrix(dims = c(0, 0), i = integer(0), j =
## integer(0), : 'giveCsparse' has been deprecated; setting 'repr = "T"' for you

## Warning in Matrix::sparseMatrix(dims = c(0, 0), i = integer(0), j =
## integer(0), : 'giveCsparse' has been deprecated; setting 'repr = "T"' for you

summary(m_voles3)

##  Family: gaussian  ( identity )
## Formula:          mass ~ 1 + body_length + sex + age + (1 | id) + (1 | year)
## Data: voles
## 
##      AIC      BIC   logLik deviance df.resid 
##  12074.2  12114.2  -6030.1  12060.2     2212 
## 
## Random effects:
## 
## Conditional model:
##  Groups   Name        Variance Std.Dev.
##  id       (Intercept)  3.7521  1.937   
##  year     (Intercept)  0.4693  0.685   
##  Residual             10.9278  3.306   
## Number of obs: 2219, groups:  id, 677; year, 7
## 
## Dispersion estimate for gaussian family (sigma^2): 10.9 
## 
## Conditional model:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -19.26855    1.63188  -11.81  < 2e-16 ***
## body_length   0.52057    0.01394   37.34  < 2e-16 ***
## sexMale       1.40825    0.22843    6.17 7.05e-10 ***
## ageJ         -6.55243    0.31763  -20.63  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

m_voles4 <- glmmTMB(mass ~ 1 + body_length + sex + age + (1|id) + (1 + sex|year), data=voles)

## Warning in Matrix::sparseMatrix(dims = c(0, 0), i = integer(0), j =
## integer(0), : 'giveCsparse' has been deprecated; setting 'repr = "T"' for you

## Warning in Matrix::sparseMatrix(dims = c(0, 0), i = integer(0), j =
## integer(0), : 'giveCsparse' has been deprecated; setting 'repr = "T"' for you

## Warning in Matrix::sparseMatrix(dims = c(0, 0), i = integer(0), j =
## integer(0), : 'giveCsparse' has been deprecated; setting 'repr = "T"' for you

## Warning in fitTMB(TMBStruc): Model convergence problem; non-positive-definite
## Hessian matrix. See vignette('troubleshooting')

summary(m_voles4)

##  Family: gaussian  ( identity )
## Formula:          
## mass ~ 1 + body_length + sex + age + (1 | id) + (1 + sex | year)
## Data: voles
## 
##      AIC      BIC   logLik deviance df.resid 
##       NA       NA       NA       NA     2210 
## 
## Random effects:
## 
## Conditional model:
##  Groups   Name        Variance  Std.Dev. Corr  
##  id       (Intercept)  3.759775 1.93901        
##  year     (Intercept)  0.525332 0.72480        
##           sexMale      0.007707 0.08779  -1.00 
##  Residual             10.922850 3.30497        
## Number of obs: 2219, groups:  id, 677; year, 7
## 
## Dispersion estimate for gaussian family (sigma^2): 10.9 
## 
## Conditional model:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -19.23745    1.63562  -11.76  < 2e-16 ***
## body_length   0.52031    0.01395   37.30  < 2e-16 ***
## sexMale       1.40432    0.23125    6.07 1.26e-09 ***
## ageJ         -6.56908    0.31951  -20.56  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

anova(m_voles3, m_voles4)

What about the age difference?

m_voles5 <- glmmTMB(mass ~ 1 + body_length + age + (1|id) + (1 + age|year), data=voles)

## Warning in Matrix::sparseMatrix(dims = c(0, 0), i = integer(0), j =
## integer(0), : 'giveCsparse' has been deprecated; setting 'repr = "T"' for you

## Warning in Matrix::sparseMatrix(dims = c(0, 0), i = integer(0), j =
## integer(0), : 'giveCsparse' has been deprecated; setting 'repr = "T"' for you

## Warning in Matrix::sparseMatrix(dims = c(0, 0), i = integer(0), j =
## integer(0), : 'giveCsparse' has been deprecated; setting 'repr = "T"' for you

summary(m_voles5)

##  Family: gaussian  ( identity )
## Formula:          mass ~ 1 + body_length + age + (1 | id) + (1 + age | year)
## Data: voles
## 
##      AIC      BIC   logLik deviance df.resid 
##  12098.9  12144.6  -6041.5  12082.9     2211 
## 
## Random effects:
## 
## Conditional model:
##  Groups   Name        Variance Std.Dev. Corr  
##  id       (Intercept)  4.2942  2.0722         
##  year     (Intercept)  0.7571  0.8701         
##           ageJ         1.0087  1.0043   -0.69 
##  Residual             10.7667  3.2813         
## Number of obs: 2219, groups:  id, 677; year, 7
## 
## Dispersion estimate for gaussian family (sigma^2): 10.8 
## 
## Conditional model:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -19.03344    1.66968  -11.40   <2e-16 ***
## body_length   0.52375    0.01408   37.20   <2e-16 ***
## ageJ         -6.40091    0.50600  -12.65   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

anova(m_voles5, m_voles2)

Let’s try to visualise that:

newdata5 <- expand_grid(body_length = seq(70, 135), age=c("A", "J"),
                        year=unique(voles$year), id=voles$id[1])

pred5_out <- predict(m_voles5, newdata = newdata5, se.fit = TRUE)

## Warning in Matrix::sparseMatrix(dims = c(0, 0), i = integer(0), j =
## integer(0), : 'giveCsparse' has been deprecated; setting 'repr = "T"' for you

## Warning in Matrix::sparseMatrix(dims = c(0, 0), i = integer(0), j =
## integer(0), : 'giveCsparse' has been deprecated; setting 'repr = "T"' for you

## Warning in Matrix::sparseMatrix(dims = c(0, 0), i = integer(0), j =
## integer(0), : 'giveCsparse' has been deprecated; setting 'repr = "T"' for you

newdata5$mass <- pred5_out$fit

ggplot(voles, aes(x=body_length, y=mass, color=age) ) +
  geom_point(alpha=0.4)+
  geom_line(data=newdata5, mapping = aes(x=body_length, y=mass, color=age, by=as.factor(year)) )

## Warning: Ignoring unknown aesthetics: by

ggplot(voles, aes(x=body_length, y=mass) ) +
  geom_point(alpha=0.4)+
  geom_line(data=newdata5, mapping = aes(x=body_length, y=mass, color=as.factor(year)),
            inherit.aes = FALSE) + facet_wrap(~age)

REyears <- ranef(m_voles5)$cond$year
REyears$totalyears <- REyears$ageJ+REyears$`(Intercept)`

Linear mixed model syntax in brief

For instance, if you have 2 fixed effect predictors and 1 random intercept effect, a simple linear mixed model can be written as:

Response = Intercept + Estimate1 × Predictor1+ Estimate2 × Predictor2 + Random deviation+ Error, with Random deviation and Error distributed both following a random normal distribution.

or more mathematically as:

\[ y_i = \mu + \beta_1 x_{1,i} + \beta_2 x_{2,i} + u_i + \epsilon_i \text{ , with } u \sim N(0, V_U) \text{ and } \epsilon \sim N(0, V_R) \]

You can fit that model as:

glmmTMB(response ~ 1 + predictor1 + predictor2 + (1|randomeffect), data=data)

or equivalently as:

glmmTMB(response ~ predictor1 + predictor2 + (1|randomeffect), data=data)