Linear models 1: interpretation, interactions, assumptions, multiple regression

Today we learn:

Linear models as a central tool for statistical inference and learning from data
Fit and interpret linear models in R
Understand additive effects and interactions
Understand multiple regression with correlated predictors
Assess model assumptions

Setting up

Packages for today:

library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──

## ✓ ggplot2 3.3.3     ✓ purrr   0.3.4
## ✓ tibble  3.1.2     ✓ dplyr   1.0.6
## ✓ tidyr   1.1.3     ✓ stringr 1.4.0
## ✓ readr   1.4.0     ✓ forcats 0.5.1

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(glmmTMB)
library(emmeans)
library(DHARMa)

## This is DHARMa 0.4.1. For overview type '?DHARMa'. For recent changes, type news(package = 'DHARMa') Note: Syntax of plotResiduals has changed in 0.3.0, see ?plotResiduals for details

Data for today:

download.file(url = "https://timotheenivalis.github.io/workshops/Stats21/Data.zip",
              destfile = "Data.zip")

Linear models versus tests

You may have learned statistics with this type of flow charts:

Forget about the zoo of tests, lm() does it all

Linear models provide a very flexible way to express ideas about your data and test those ideas.

Once you understand the basics of linear models you can build on them to obtain:

Mixed effect models
Generalized linear models
GLMMs
Mixture models
Non-linear state-space models
Additive models, splines, smoothers
… and much more

Fitting simple linear models

General guidelines:

Define your scientific question(s)
- Response variable
- Experimental factors
- Blocking/confounding factors
Visualise your data in a way that answers your question
Fit corresponding model to put numbers on what you see
Check assumptions and think about how much you trust your results

First example: simple regression with 2 levels

Leaf temperature under different water conditions for two different genotypes.

drought_data <- read_csv("Data/droughtdata.csv")

## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   plant = col_double(),
##   Genotype = col_character(),
##   WaterCondition = col_character(),
##   Temperature = col_double()
## )

drought_data

For now we just want ask whether temperature differs between genotypes in drought conditions.

Let’s subset the data:

drought_only <- drought_data %>% filter(WaterCondition=="Drought")

ggplot(drought_only, aes(x=Genotype, y=Temperature)) + geom_boxplot()

Corresponding model:

m_drought <- glmmTMB(Temperature ~ Genotype, data = drought_only)

## Warning in Matrix::sparseMatrix(dims = c(0, 0), i = integer(0), j =
## integer(0), : 'giveCsparse' has been deprecated; setting 'repr = "T"' for you

## Warning in Matrix::sparseMatrix(dims = c(0, 0), i = integer(0), j =
## integer(0), : 'giveCsparse' has been deprecated; setting 'repr = "T"' for you

## Warning in Matrix::sparseMatrix(dims = c(0, 0), i = integer(0), j =
## integer(0), : 'giveCsparse' has been deprecated; setting 'repr = "T"' for you

summary(m_drought)

##  Family: gaussian  ( identity )
## Formula:          Temperature ~ Genotype
## Data: drought_only
## 
##      AIC      BIC   logLik deviance df.resid 
##     71.7     74.0    -32.9     65.7       13 
## 
## 
## Dispersion estimate for gaussian family (sigma^2): 3.56 
## 
## Conditional model:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  31.6875     0.6671   47.50  < 2e-16 ***
## GenotypeWT    6.9000     0.9434    7.31 2.59e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

How to read parameter estimates? Compare the mean in each genotype to the estimates:

drought_only %>% 
  group_by(Genotype) %>%
  summarise(mean(Temperature))

fixef(m_drought)$cond["(Intercept)"]

## (Intercept) 
##     31.6875

fixef(m_drought)$cond["(Intercept)"] + fixef(m_drought)$cond["GenotypeWT"]

## (Intercept) 
##     38.5875

So the parameter GenotypeWT corresponds to the difference in the means of GenotypeWT - GenotypeMutant.

You can obtain the predicted mean responses for each level, with standard error and confidence interval using the package emmeans:

emmeans(m_drought, ~ Genotype)

##  Genotype emmean    SE df lower.CL upper.CL
##  mutant     31.7 0.667 13     30.2     33.1
##  WT         38.6 0.667 13     37.1     40.0
## 
## Confidence level used: 0.95

How to read measures of statistical significance?

The standard error tells you about the uncertainty in the estimation. The true value of a parameter is expected to full within +/- 2 standard errors about 95% of the time.

z-value = Estimate / Std. Error and is used to compute the p-value.

The p-value ( Pr(>|z|) ) approximates the frequency you would see such a z-value if the true value of the parameter was exactly zero.

** How to visualise the results? **

drought_prediction <- as_tibble(emmeans(m_drought, ~ Genotype))

ggplot(drought_only, aes(x=Genotype, y=Temperature)) + geom_boxplot() +
  geom_point(data=drought_prediction, aes(y = emmean), color="red")+
  geom_errorbar(data=drought_prediction, aes(x=Genotype, ymin = lower.CL, ymax=upper.CL),
                color="red", inherit.aes = FALSE)

The error bars show 95% ranges of plausible true mean temperature values (in a larger population); they do NOT predict the spread of points.

Making it a bit prettier:

ggplot(drought_only, aes(x=Genotype, y=Temperature)) + geom_point() +
  geom_point(data=drought_prediction, aes(y = emmean), color="red", size=3)+
  geom_errorbar(data=drought_prediction, aes(x=Genotype, ymin = lower.CL, ymax=upper.CL),
                color="red", inherit.aes = FALSE, width=0.1) + theme_bw()

Additive and interactive models

Leaf temperature under different water conditions for two different genotypes. Let us now consider the full dataset.

drought_data <- read_csv("Data/droughtdata.csv")

## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   plant = col_double(),
##   Genotype = col_character(),
##   WaterCondition = col_character(),
##   Temperature = col_double()
## )

drought_data

Do genotypes differ in their response to drought?

ggplot(drought_data, aes(x = WaterCondition, y= Temperature, color=Genotype)) +
  geom_boxplot()

m_additive <- glmmTMB(Temperature ~ Genotype + WaterCondition, data = drought_data)

## Warning in Matrix::sparseMatrix(dims = c(0, 0), i = integer(0), j =
## integer(0), : 'giveCsparse' has been deprecated; setting 'repr = "T"' for you

## Warning in Matrix::sparseMatrix(dims = c(0, 0), i = integer(0), j =
## integer(0), : 'giveCsparse' has been deprecated; setting 'repr = "T"' for you

## Warning in Matrix::sparseMatrix(dims = c(0, 0), i = integer(0), j =
## integer(0), : 'giveCsparse' has been deprecated; setting 'repr = "T"' for you

summary(m_additive)

##  Family: gaussian  ( identity )
## Formula:          Temperature ~ Genotype + WaterCondition
## Data: drought_data
## 
##      AIC      BIC   logLik deviance df.resid 
##    153.5    159.3    -72.7    145.5       28 
## 
## 
## Dispersion estimate for gaussian family (sigma^2): 5.52 
## 
## Conditional model:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)           33.4688     0.7191   46.55  < 2e-16 ***
## GenotypeWT             3.3375     0.8303    4.02 5.83e-05 ***
## WaterConditionNormal  -3.1750     0.8303   -3.82 0.000131 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

emmeans(m_additive, ~ WaterCondition + Genotype)

##  WaterCondition Genotype emmean    SE df lower.CL upper.CL
##  Drought        mutant     33.5 0.719 28     32.0     34.9
##  Normal         mutant     30.3 0.719 28     28.8     31.8
##  Drought        WT         36.8 0.719 28     35.3     38.3
##  Normal         WT         33.6 0.719 28     32.2     35.1
## 
## Confidence level used: 0.95

This model tells is there is an overall difference among genotypes and an overall difference among water conditions, but that does not answer our question.

drought_data %>%
  group_by(WaterCondition, Genotype) %>%
  summarise(mean(Temperature))

## `summarise()` has grouped output by 'WaterCondition'. You can override using the `.groups` argument.

additive_prediction <- as_tibble(emmeans(m_additive, ~ WaterCondition + Genotype))

ggplot(drought_data, aes(x = WaterCondition, y= Temperature, color=Genotype)) +
  geom_jitter(alpha=0.3, width=0.1, height = 0) +
  geom_point(data=additive_prediction, aes(y=emmean), size=3) +
  geom_errorbar(data=additive_prediction,
                aes(ymin=lower.CL, ymax=upper.CL,x=WaterCondition, color=Genotype),
                width=0.1, inherit.aes = FALSE)

m_interaction <- glmmTMB(Temperature ~ Genotype * WaterCondition, data = drought_data)

## Warning in Matrix::sparseMatrix(dims = c(0, 0), i = integer(0), j =
## integer(0), : 'giveCsparse' has been deprecated; setting 'repr = "T"' for you

## Warning in Matrix::sparseMatrix(dims = c(0, 0), i = integer(0), j =
## integer(0), : 'giveCsparse' has been deprecated; setting 'repr = "T"' for you

## Warning in Matrix::sparseMatrix(dims = c(0, 0), i = integer(0), j =
## integer(0), : 'giveCsparse' has been deprecated; setting 'repr = "T"' for you

summary(m_interaction)

##  Family: gaussian  ( identity )
## Formula:          Temperature ~ Genotype * WaterCondition
## Data: drought_data
## 
##      AIC      BIC   logLik deviance df.resid 
##    128.0    135.4    -59.0    118.0       27 
## 
## 
## Dispersion estimate for gaussian family (sigma^2): 2.34 
## 
## Conditional model:
##                                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                      31.6875     0.5411   58.56  < 2e-16 ***
## GenotypeWT                        6.9000     0.7652    9.02  < 2e-16 ***
## WaterConditionNormal              0.3875     0.7652    0.51    0.613    
## GenotypeWT:WaterConditionNormal  -7.1250     1.0822   -6.58 4.58e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

emmeans(m_interaction, ~ WaterCondition + Genotype)

##  WaterCondition Genotype emmean    SE df lower.CL upper.CL
##  Drought        mutant     31.7 0.541 27     30.6     32.8
##  Normal         mutant     32.1 0.541 27     31.0     33.2
##  Drought        WT         38.6 0.541 27     37.5     39.7
##  Normal         WT         31.9 0.541 27     30.7     33.0
## 
## Confidence level used: 0.95

The parameter GenotypeWT:WaterConditionNormal tests for an interaction.

You can recover means by hand:

#Mutant in drought
fixef(m_interaction)$cond["(Intercept)"]

## (Intercept) 
##     31.6875

#WT in drought
fixef(m_interaction)$cond["(Intercept)"] + fixef(m_interaction)$cond["GenotypeWT"]

## (Intercept) 
##     38.5875

#Mutant in normal
fixef(m_interaction)$cond["(Intercept)"] + fixef(m_interaction)$cond["WaterConditionNormal"]

## (Intercept) 
##      32.075

#WT in normal
fixef(m_interaction)$cond["(Intercept)"] + fixef(m_interaction)$cond["GenotypeWT"] +
  fixef(m_interaction)$cond["WaterConditionNormal"] + fixef(m_interaction)$cond["GenotypeWT:WaterConditionNormal"]

## (Intercept) 
##       31.85

interaction_prediction <- as_tibble(emmeans(m_interaction, ~ WaterCondition + Genotype))

ggplot(drought_data, aes(x = WaterCondition, y= Temperature, color=Genotype)) +
  geom_jitter(alpha=0.3, width=0.1, height =0) +
  geom_point(data=interaction_prediction, aes(y=emmean), size=3) +
  geom_errorbar(data=interaction_prediction,
                aes(ymin=lower.CL, ymax=upper.CL,x=WaterCondition, color=Genotype),
                width=0.1, inherit.aes = FALSE)

Don’t forget to assess model assumptions

What are linear model assumptions? Let’s keep it simple for today. Let’s just say:

Residuals are randomly distributed, approximately following a normal distribution, with the same variance everywhere.

If the assumptions are not met we know that there may be issues with the estimation of parameters, and especially with the calculation of standard errors and p-values. The model answers are not necessarily wrong, but maybe we can do better, we need to be a bit careful and it would be good to understand where deviations from assumptions come from and whether that is a bad problem.

If the assumptions are met, we can relax a bit more, although there can always be fundamental problems that are difficult to detect.

Visual inspection of predictions and raw data

Looking at our ggplot with raw data and predictions we can see:

The model predicts the mean values well
Raw data seem spread homogenously and randomly around the means
No clear outlier or anything unusual

More formal assesment using DHARMa

simres <- simulateResiduals(m_interaction)

## Warning in Matrix::sparseMatrix(dims = c(0, 0), i = integer(0), j =
## integer(0), : 'giveCsparse' has been deprecated; setting 'repr = "T"' for you

## Warning in Matrix::sparseMatrix(dims = c(0, 0), i = integer(0), j =
## integer(0), : 'giveCsparse' has been deprecated; setting 'repr = "T"' for you

## Warning in Matrix::sparseMatrix(dims = c(0, 0), i = integer(0), j =
## integer(0), : 'giveCsparse' has been deprecated; setting 'repr = "T"' for you

## Warning in Matrix::sparseMatrix(dims = c(0, 0), i = integer(0), j =
## integer(0), : 'giveCsparse' has been deprecated; setting 'repr = "T"' for you

## Warning in Matrix::sparseMatrix(dims = c(0, 0), i = integer(0), j =
## integer(0), : 'giveCsparse' has been deprecated; setting 'repr = "T"' for you

## Warning in Matrix::sparseMatrix(dims = c(0, 0), i = integer(0), j =
## integer(0), : 'giveCsparse' has been deprecated; setting 'repr = "T"' for you

plot(simres)

All looks good here.

DHARMa may look like an overkill here, but it works well for more complex models, so why not start using it now?

A multiple regression is not multiple regressions

A silly example

How far can kids jump depending on their mass?

jumping <- read_csv("Data/jumpingdistance.csv")

## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   jump = col_double(),
##   mass = col_double(),
##   height = col_double()
## )

jumping

m_mass <- glmmTMB(jump ~ mass, data=jumping)

## Warning in Matrix::sparseMatrix(dims = c(0, 0), i = integer(0), j =
## integer(0), : 'giveCsparse' has been deprecated; setting 'repr = "T"' for you

## Warning in Matrix::sparseMatrix(dims = c(0, 0), i = integer(0), j =
## integer(0), : 'giveCsparse' has been deprecated; setting 'repr = "T"' for you

## Warning in Matrix::sparseMatrix(dims = c(0, 0), i = integer(0), j =
## integer(0), : 'giveCsparse' has been deprecated; setting 'repr = "T"' for you

summary(m_mass)

##  Family: gaussian  ( identity )
## Formula:          jump ~ mass
## Data: jumping
## 
##      AIC      BIC   logLik deviance df.resid 
##   2998.1   3012.8  -1496.0   2992.1      997 
## 
## 
## Dispersion estimate for gaussian family (sigma^2): 1.17 
## 
## Conditional model:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) 1.474112   0.237029   6.219    5e-10 ***
## mass        0.040010   0.004271   9.369   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

m_height <- glmmTMB(jump ~ height, data=jumping)

## Warning in Matrix::sparseMatrix(dims = c(0, 0), i = integer(0), j =
## integer(0), : 'giveCsparse' has been deprecated; setting 'repr = "T"' for you

## Warning in Matrix::sparseMatrix(dims = c(0, 0), i = integer(0), j =
## integer(0), : 'giveCsparse' has been deprecated; setting 'repr = "T"' for you

## Warning in Matrix::sparseMatrix(dims = c(0, 0), i = integer(0), j =
## integer(0), : 'giveCsparse' has been deprecated; setting 'repr = "T"' for you

summary(m_mass)

##  Family: gaussian  ( identity )
## Formula:          jump ~ mass
## Data: jumping
## 
##      AIC      BIC   logLik deviance df.resid 
##   2998.1   3012.8  -1496.0   2992.1      997 
## 
## 
## Dispersion estimate for gaussian family (sigma^2): 1.17 
## 
## Conditional model:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) 1.474112   0.237029   6.219    5e-10 ***
## mass        0.040010   0.004271   9.369   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

ggplot(jumping, aes(x=mass, y=jump)) + geom_point()

m_massheight <- glmmTMB(jump ~ mass + height, data=jumping)

## Warning in Matrix::sparseMatrix(dims = c(0, 0), i = integer(0), j =
## integer(0), : 'giveCsparse' has been deprecated; setting 'repr = "T"' for you

## Warning in Matrix::sparseMatrix(dims = c(0, 0), i = integer(0), j =
## integer(0), : 'giveCsparse' has been deprecated; setting 'repr = "T"' for you

## Warning in Matrix::sparseMatrix(dims = c(0, 0), i = integer(0), j =
## integer(0), : 'giveCsparse' has been deprecated; setting 'repr = "T"' for you

summary(m_massheight)

##  Family: gaussian  ( identity )
## Formula:          jump ~ mass + height
## Data: jumping
## 
##      AIC      BIC   logLik deviance df.resid 
##   2931.6   2951.2  -1461.8   2923.6      996 
## 
## 
## Dispersion estimate for gaussian family (sigma^2): 1.09 
## 
## Conditional model:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -4.446438   0.739727  -6.011 1.84e-09 ***
## mass        -0.019594   0.008196  -2.391   0.0168 *  
## height       5.566775   0.661342   8.417  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The direct (causal) effect of mass is negative, as revealed by a multiple regression.

Parameter estimates often change value or even sign when you include or remove other predictors.

library(plotly)

## 
## Attaching package: 'plotly'

## The following object is masked from 'package:ggplot2':
## 
##     last_plot

## The following object is masked from 'package:stats':
## 
##     filter

## The following object is masked from 'package:graphics':
## 
##     layout

plot_ly(x=jumping$mass, y=jumping$height, z=jumping$jump, 
        type="scatter3d", mode="markers", 
        color = jumping$jump, alpha = 0.8) %>%
    layout(
    title = "Layout options in a 3d scatter plot",
    scene = list(
      xaxis = list(title = "mass"),
      yaxis = list(title = "height"),
      zaxis = list(title = "jumping distance")
    ))

Multiple regression real example

Snow vole (that is a rodent) data. What explains variation in body mass?

voles <- read_csv("Data/voles.csv")

## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   id = col_character(),
##   mass = col_double(),
##   body_length = col_double(),
##   tail_length = col_double(),
##   sex = col_character(),
##   age = col_character(),
##   year = col_double(),
##   x_coord = col_double(),
##   y_coord = col_double(),
##   day = col_double(),
##   inbreeding = col_double()
## )

Are adult voles heavier than juveniles?

ggplot(voles, aes(x=age, y=mass)) + geom_violin() + geom_jitter(width = 0.1, height = 0)

## Warning: Removed 114 rows containing non-finite values (stat_ydensity).

## Warning: Removed 114 rows containing missing values (geom_point).

m0 <- glmmTMB(mass ~ 1 + age, data=voles)

## Warning in Matrix::sparseMatrix(dims = c(0, 0), i = integer(0), j =
## integer(0), : 'giveCsparse' has been deprecated; setting 'repr = "T"' for you

## Warning in Matrix::sparseMatrix(dims = c(0, 0), i = integer(0), j =
## integer(0), : 'giveCsparse' has been deprecated; setting 'repr = "T"' for you

## Warning in Matrix::sparseMatrix(dims = c(0, 0), i = integer(0), j =
## integer(0), : 'giveCsparse' has been deprecated; setting 'repr = "T"' for you

summary(m0)

##  Family: gaussian  ( identity )
## Formula:          mass ~ 1 + age
## Data: voles
## 
##      AIC      BIC   logLik deviance df.resid 
##  17976.8  17994.7  -8985.4  17970.8     2906 
## 
## 
## Dispersion estimate for gaussian family (sigma^2): 28.2 
## 
## Conditional model:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  42.0208     0.1295   324.6   <2e-16 ***
## ageJ        -15.9849     0.1994   -80.1   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Yes, on average, 16g heavier.

But, are they heavier given their body length?

ggplot(voles, aes(x=body_length, y=mass, color=age)) + geom_point()

## Warning: Removed 786 rows containing missing values (geom_point).

m1 <- glmmTMB(mass ~ 1 + body_length + age, data=voles)

## Warning in Matrix::sparseMatrix(dims = c(0, 0), i = integer(0), j =
## integer(0), : 'giveCsparse' has been deprecated; setting 'repr = "T"' for you

## Warning in Matrix::sparseMatrix(dims = c(0, 0), i = integer(0), j =
## integer(0), : 'giveCsparse' has been deprecated; setting 'repr = "T"' for you

## Warning in Matrix::sparseMatrix(dims = c(0, 0), i = integer(0), j =
## integer(0), : 'giveCsparse' has been deprecated; setting 'repr = "T"' for you

summary(m1)

##  Family: gaussian  ( identity )
## Formula:          mass ~ 1 + body_length + age
## Data: voles
## 
##      AIC      BIC   logLik deviance df.resid 
##  12562.7  12585.6  -6277.4  12554.7     2233 
## 
## 
## Dispersion estimate for gaussian family (sigma^2):   16 
## 
## Conditional model:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -27.58046    1.62062  -17.02   <2e-16 ***
## body_length   0.59747    0.01389   43.03   <2e-16 ***
## ageJ         -5.07780    0.30696  -16.54   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Given their length, adult voles are just 5g heavier than juveniles. But juveniles are also shorter, so overall the mass difference appears larger if you do not account for body length.

Which approach is correct? Both, it depends what your question is!

By the way, age and body length are highly correlated:

cor(1-as.integer(as.factor(voles$age)),
    voles$body_length, use = "pairwise.complete.obs")

## [1] 0.8265616

Contrary to what you may have heard, high correlation between predictors is not a problem in linear models. But it is important to know about the correlation structure of predictors to interpret the results correctly.

Evaluate the model

sim_m1 <- simulateResiduals(m1)

## Warning in Matrix::sparseMatrix(dims = c(0, 0), i = integer(0), j =
## integer(0), : 'giveCsparse' has been deprecated; setting 'repr = "T"' for you

## Warning in Matrix::sparseMatrix(dims = c(0, 0), i = integer(0), j =
## integer(0), : 'giveCsparse' has been deprecated; setting 'repr = "T"' for you

## Warning in Matrix::sparseMatrix(dims = c(0, 0), i = integer(0), j =
## integer(0), : 'giveCsparse' has been deprecated; setting 'repr = "T"' for you

## Warning in Matrix::sparseMatrix(dims = c(0, 0), i = integer(0), j =
## integer(0), : 'giveCsparse' has been deprecated; setting 'repr = "T"' for you

## Warning in Matrix::sparseMatrix(dims = c(0, 0), i = integer(0), j =
## integer(0), : 'giveCsparse' has been deprecated; setting 'repr = "T"' for you

## Warning in Matrix::sparseMatrix(dims = c(0, 0), i = integer(0), j =
## integer(0), : 'giveCsparse' has been deprecated; setting 'repr = "T"' for you

plot(sim_m1)

Reasons for the lack of fit?

pred_voles <- as_tibble(emmeans(m1, spec = ~ age + body_length,
        at=list(body_length=seq(min(voles$body_length, na.rm = TRUE),
                                max(voles$body_length, na.rm = TRUE)))))

ggplot(voles, aes(x=body_length, y=mass, color=age)) + geom_point() + 
  geom_line(data = pred_voles, aes(y=emmean), size=2)

## Warning: Removed 786 rows containing missing values (geom_point).

Probably non-linear scaling of masss with body length, interaction age-body length, probably heterogeneity not accounted for.

Linear model syntax in brief

For instance, if you have 2 predictors, a simple linear model can be written as:

Response = Intercept + Estimate1 × Predictor1+ Estimate2 × Predictor2 + Error, with Error distributed following a random normal distribution.

or more mathematically as:

\[ y_i = \mu + \beta_1 x_{1,i} + \beta_2 x_{2,i} + \epsilon_i \text{, } \epsilon \sim N(0, V_R) \]

You can fit that model as:

glmmTMB(response ~ 1 + predictor1 + predictor2, data=data)

or equivalently as:

glmmTMB(response ~ predictor1 + predictor2, data=data)