Today we learn:
Packages for today:
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.3 ✓ purrr 0.3.4
## ✓ tibble 3.1.2 ✓ dplyr 1.0.6
## ✓ tidyr 1.1.3 ✓ stringr 1.4.0
## ✓ readr 1.4.0 ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(glmmTMB)
library(emmeans)
library(DHARMa)
## This is DHARMa 0.4.1. For overview type '?DHARMa'. For recent changes, type news(package = 'DHARMa') Note: Syntax of plotResiduals has changed in 0.3.0, see ?plotResiduals for details
Data for today:
download.file(url = "https://timotheenivalis.github.io/workshops/Stats21/Data.zip",
destfile = "Data.zip")
You may have learned statistics with this type of flow charts:
Linear models provide a very flexible way to express ideas about your data and test those ideas.
Once you understand the basics of linear models you can build on them to obtain:
General guidelines:
Leaf temperature under different water conditions for two different genotypes.
drought_data <- read_csv("Data/droughtdata.csv")
##
## ── Column specification ────────────────────────────────────────────────────────
## cols(
## plant = col_double(),
## Genotype = col_character(),
## WaterCondition = col_character(),
## Temperature = col_double()
## )
drought_data
For now we just want ask whether temperature differs between genotypes in drought conditions.
Let’s subset the data:
drought_only <- drought_data %>% filter(WaterCondition=="Drought")
ggplot(drought_only, aes(x=Genotype, y=Temperature)) + geom_boxplot()
Corresponding model:
m_drought <- glmmTMB(Temperature ~ Genotype, data = drought_only)
## Warning in Matrix::sparseMatrix(dims = c(0, 0), i = integer(0), j =
## integer(0), : 'giveCsparse' has been deprecated; setting 'repr = "T"' for you
## Warning in Matrix::sparseMatrix(dims = c(0, 0), i = integer(0), j =
## integer(0), : 'giveCsparse' has been deprecated; setting 'repr = "T"' for you
## Warning in Matrix::sparseMatrix(dims = c(0, 0), i = integer(0), j =
## integer(0), : 'giveCsparse' has been deprecated; setting 'repr = "T"' for you
summary(m_drought)
## Family: gaussian ( identity )
## Formula: Temperature ~ Genotype
## Data: drought_only
##
## AIC BIC logLik deviance df.resid
## 71.7 74.0 -32.9 65.7 13
##
##
## Dispersion estimate for gaussian family (sigma^2): 3.56
##
## Conditional model:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 31.6875 0.6671 47.50 < 2e-16 ***
## GenotypeWT 6.9000 0.9434 7.31 2.59e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
How to read parameter estimates? Compare the mean in each genotype to the estimates:
drought_only %>%
group_by(Genotype) %>%
summarise(mean(Temperature))
fixef(m_drought)$cond["(Intercept)"]
## (Intercept)
## 31.6875
fixef(m_drought)$cond["(Intercept)"] + fixef(m_drought)$cond["GenotypeWT"]
## (Intercept)
## 38.5875
So the parameter GenotypeWT
corresponds to the difference in the means of GenotypeWT - GenotypeMutant.
You can obtain the predicted mean responses for each level, with standard error and confidence interval using the package emmeans:
emmeans(m_drought, ~ Genotype)
## Genotype emmean SE df lower.CL upper.CL
## mutant 31.7 0.667 13 30.2 33.1
## WT 38.6 0.667 13 37.1 40.0
##
## Confidence level used: 0.95
How to read measures of statistical significance?
The standard error tells you about the uncertainty in the estimation. The true value of a parameter is expected to full within +/- 2 standard errors about 95% of the time.
z-value = Estimate / Std. Error and is used to compute the p-value.
The p-value ( Pr(>|z|) ) approximates the frequency you would see such a z-value if the true value of the parameter was exactly zero.
** How to visualise the results? **
drought_prediction <- as_tibble(emmeans(m_drought, ~ Genotype))
ggplot(drought_only, aes(x=Genotype, y=Temperature)) + geom_boxplot() +
geom_point(data=drought_prediction, aes(y = emmean), color="red")+
geom_errorbar(data=drought_prediction, aes(x=Genotype, ymin = lower.CL, ymax=upper.CL),
color="red", inherit.aes = FALSE)
The error bars show 95% ranges of plausible true mean temperature values (in a larger population); they do NOT predict the spread of points.
Making it a bit prettier:
ggplot(drought_only, aes(x=Genotype, y=Temperature)) + geom_point() +
geom_point(data=drought_prediction, aes(y = emmean), color="red", size=3)+
geom_errorbar(data=drought_prediction, aes(x=Genotype, ymin = lower.CL, ymax=upper.CL),
color="red", inherit.aes = FALSE, width=0.1) + theme_bw()
Leaf temperature under different water conditions for two different genotypes. Let us now consider the full dataset.
drought_data <- read_csv("Data/droughtdata.csv")
##
## ── Column specification ────────────────────────────────────────────────────────
## cols(
## plant = col_double(),
## Genotype = col_character(),
## WaterCondition = col_character(),
## Temperature = col_double()
## )
drought_data
Do genotypes differ in their response to drought?
ggplot(drought_data, aes(x = WaterCondition, y= Temperature, color=Genotype)) +
geom_boxplot()
m_additive <- glmmTMB(Temperature ~ Genotype + WaterCondition, data = drought_data)
## Warning in Matrix::sparseMatrix(dims = c(0, 0), i = integer(0), j =
## integer(0), : 'giveCsparse' has been deprecated; setting 'repr = "T"' for you
## Warning in Matrix::sparseMatrix(dims = c(0, 0), i = integer(0), j =
## integer(0), : 'giveCsparse' has been deprecated; setting 'repr = "T"' for you
## Warning in Matrix::sparseMatrix(dims = c(0, 0), i = integer(0), j =
## integer(0), : 'giveCsparse' has been deprecated; setting 'repr = "T"' for you
summary(m_additive)
## Family: gaussian ( identity )
## Formula: Temperature ~ Genotype + WaterCondition
## Data: drought_data
##
## AIC BIC logLik deviance df.resid
## 153.5 159.3 -72.7 145.5 28
##
##
## Dispersion estimate for gaussian family (sigma^2): 5.52
##
## Conditional model:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 33.4688 0.7191 46.55 < 2e-16 ***
## GenotypeWT 3.3375 0.8303 4.02 5.83e-05 ***
## WaterConditionNormal -3.1750 0.8303 -3.82 0.000131 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
emmeans(m_additive, ~ WaterCondition + Genotype)
## WaterCondition Genotype emmean SE df lower.CL upper.CL
## Drought mutant 33.5 0.719 28 32.0 34.9
## Normal mutant 30.3 0.719 28 28.8 31.8
## Drought WT 36.8 0.719 28 35.3 38.3
## Normal WT 33.6 0.719 28 32.2 35.1
##
## Confidence level used: 0.95
This model tells is there is an overall difference among genotypes and an overall difference among water conditions, but that does not answer our question.
drought_data %>%
group_by(WaterCondition, Genotype) %>%
summarise(mean(Temperature))
## `summarise()` has grouped output by 'WaterCondition'. You can override using the `.groups` argument.
additive_prediction <- as_tibble(emmeans(m_additive, ~ WaterCondition + Genotype))
ggplot(drought_data, aes(x = WaterCondition, y= Temperature, color=Genotype)) +
geom_jitter(alpha=0.3, width=0.1, height = 0) +
geom_point(data=additive_prediction, aes(y=emmean), size=3) +
geom_errorbar(data=additive_prediction,
aes(ymin=lower.CL, ymax=upper.CL,x=WaterCondition, color=Genotype),
width=0.1, inherit.aes = FALSE)
m_interaction <- glmmTMB(Temperature ~ Genotype * WaterCondition, data = drought_data)
## Warning in Matrix::sparseMatrix(dims = c(0, 0), i = integer(0), j =
## integer(0), : 'giveCsparse' has been deprecated; setting 'repr = "T"' for you
## Warning in Matrix::sparseMatrix(dims = c(0, 0), i = integer(0), j =
## integer(0), : 'giveCsparse' has been deprecated; setting 'repr = "T"' for you
## Warning in Matrix::sparseMatrix(dims = c(0, 0), i = integer(0), j =
## integer(0), : 'giveCsparse' has been deprecated; setting 'repr = "T"' for you
summary(m_interaction)
## Family: gaussian ( identity )
## Formula: Temperature ~ Genotype * WaterCondition
## Data: drought_data
##
## AIC BIC logLik deviance df.resid
## 128.0 135.4 -59.0 118.0 27
##
##
## Dispersion estimate for gaussian family (sigma^2): 2.34
##
## Conditional model:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 31.6875 0.5411 58.56 < 2e-16 ***
## GenotypeWT 6.9000 0.7652 9.02 < 2e-16 ***
## WaterConditionNormal 0.3875 0.7652 0.51 0.613
## GenotypeWT:WaterConditionNormal -7.1250 1.0822 -6.58 4.58e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
emmeans(m_interaction, ~ WaterCondition + Genotype)
## WaterCondition Genotype emmean SE df lower.CL upper.CL
## Drought mutant 31.7 0.541 27 30.6 32.8
## Normal mutant 32.1 0.541 27 31.0 33.2
## Drought WT 38.6 0.541 27 37.5 39.7
## Normal WT 31.9 0.541 27 30.7 33.0
##
## Confidence level used: 0.95
The parameter GenotypeWT:WaterConditionNormal
tests for an interaction.
You can recover means by hand:
#Mutant in drought
fixef(m_interaction)$cond["(Intercept)"]
## (Intercept)
## 31.6875
#WT in drought
fixef(m_interaction)$cond["(Intercept)"] + fixef(m_interaction)$cond["GenotypeWT"]
## (Intercept)
## 38.5875
#Mutant in normal
fixef(m_interaction)$cond["(Intercept)"] + fixef(m_interaction)$cond["WaterConditionNormal"]
## (Intercept)
## 32.075
#WT in normal
fixef(m_interaction)$cond["(Intercept)"] + fixef(m_interaction)$cond["GenotypeWT"] +
fixef(m_interaction)$cond["WaterConditionNormal"] + fixef(m_interaction)$cond["GenotypeWT:WaterConditionNormal"]
## (Intercept)
## 31.85
interaction_prediction <- as_tibble(emmeans(m_interaction, ~ WaterCondition + Genotype))
ggplot(drought_data, aes(x = WaterCondition, y= Temperature, color=Genotype)) +
geom_jitter(alpha=0.3, width=0.1, height =0) +
geom_point(data=interaction_prediction, aes(y=emmean), size=3) +
geom_errorbar(data=interaction_prediction,
aes(ymin=lower.CL, ymax=upper.CL,x=WaterCondition, color=Genotype),
width=0.1, inherit.aes = FALSE)
What are linear model assumptions? Let’s keep it simple for today. Let’s just say:
Residuals are randomly distributed, approximately following a normal distribution, with the same variance everywhere.
If the assumptions are not met we know that there may be issues with the estimation of parameters, and especially with the calculation of standard errors and p-values. The model answers are not necessarily wrong, but maybe we can do better, we need to be a bit careful and it would be good to understand where deviations from assumptions come from and whether that is a bad problem.
If the assumptions are met, we can relax a bit more, although there can always be fundamental problems that are difficult to detect.
Looking at our ggplot with raw data and predictions we can see:
simres <- simulateResiduals(m_interaction)
## Warning in Matrix::sparseMatrix(dims = c(0, 0), i = integer(0), j =
## integer(0), : 'giveCsparse' has been deprecated; setting 'repr = "T"' for you
## Warning in Matrix::sparseMatrix(dims = c(0, 0), i = integer(0), j =
## integer(0), : 'giveCsparse' has been deprecated; setting 'repr = "T"' for you
## Warning in Matrix::sparseMatrix(dims = c(0, 0), i = integer(0), j =
## integer(0), : 'giveCsparse' has been deprecated; setting 'repr = "T"' for you
## Warning in Matrix::sparseMatrix(dims = c(0, 0), i = integer(0), j =
## integer(0), : 'giveCsparse' has been deprecated; setting 'repr = "T"' for you
## Warning in Matrix::sparseMatrix(dims = c(0, 0), i = integer(0), j =
## integer(0), : 'giveCsparse' has been deprecated; setting 'repr = "T"' for you
## Warning in Matrix::sparseMatrix(dims = c(0, 0), i = integer(0), j =
## integer(0), : 'giveCsparse' has been deprecated; setting 'repr = "T"' for you
plot(simres)
All looks good here.
DHARMa may look like an overkill here, but it works well for more complex models, so why not start using it now?
How far can kids jump depending on their mass?
jumping <- read_csv("Data/jumpingdistance.csv")
##
## ── Column specification ────────────────────────────────────────────────────────
## cols(
## jump = col_double(),
## mass = col_double(),
## height = col_double()
## )
jumping
m_mass <- glmmTMB(jump ~ mass, data=jumping)
## Warning in Matrix::sparseMatrix(dims = c(0, 0), i = integer(0), j =
## integer(0), : 'giveCsparse' has been deprecated; setting 'repr = "T"' for you
## Warning in Matrix::sparseMatrix(dims = c(0, 0), i = integer(0), j =
## integer(0), : 'giveCsparse' has been deprecated; setting 'repr = "T"' for you
## Warning in Matrix::sparseMatrix(dims = c(0, 0), i = integer(0), j =
## integer(0), : 'giveCsparse' has been deprecated; setting 'repr = "T"' for you
summary(m_mass)
## Family: gaussian ( identity )
## Formula: jump ~ mass
## Data: jumping
##
## AIC BIC logLik deviance df.resid
## 2998.1 3012.8 -1496.0 2992.1 997
##
##
## Dispersion estimate for gaussian family (sigma^2): 1.17
##
## Conditional model:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.474112 0.237029 6.219 5e-10 ***
## mass 0.040010 0.004271 9.369 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
m_height <- glmmTMB(jump ~ height, data=jumping)
## Warning in Matrix::sparseMatrix(dims = c(0, 0), i = integer(0), j =
## integer(0), : 'giveCsparse' has been deprecated; setting 'repr = "T"' for you
## Warning in Matrix::sparseMatrix(dims = c(0, 0), i = integer(0), j =
## integer(0), : 'giveCsparse' has been deprecated; setting 'repr = "T"' for you
## Warning in Matrix::sparseMatrix(dims = c(0, 0), i = integer(0), j =
## integer(0), : 'giveCsparse' has been deprecated; setting 'repr = "T"' for you
summary(m_mass)
## Family: gaussian ( identity )
## Formula: jump ~ mass
## Data: jumping
##
## AIC BIC logLik deviance df.resid
## 2998.1 3012.8 -1496.0 2992.1 997
##
##
## Dispersion estimate for gaussian family (sigma^2): 1.17
##
## Conditional model:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.474112 0.237029 6.219 5e-10 ***
## mass 0.040010 0.004271 9.369 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
ggplot(jumping, aes(x=mass, y=jump)) + geom_point()
m_massheight <- glmmTMB(jump ~ mass + height, data=jumping)
## Warning in Matrix::sparseMatrix(dims = c(0, 0), i = integer(0), j =
## integer(0), : 'giveCsparse' has been deprecated; setting 'repr = "T"' for you
## Warning in Matrix::sparseMatrix(dims = c(0, 0), i = integer(0), j =
## integer(0), : 'giveCsparse' has been deprecated; setting 'repr = "T"' for you
## Warning in Matrix::sparseMatrix(dims = c(0, 0), i = integer(0), j =
## integer(0), : 'giveCsparse' has been deprecated; setting 'repr = "T"' for you
summary(m_massheight)
## Family: gaussian ( identity )
## Formula: jump ~ mass + height
## Data: jumping
##
## AIC BIC logLik deviance df.resid
## 2931.6 2951.2 -1461.8 2923.6 996
##
##
## Dispersion estimate for gaussian family (sigma^2): 1.09
##
## Conditional model:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.446438 0.739727 -6.011 1.84e-09 ***
## mass -0.019594 0.008196 -2.391 0.0168 *
## height 5.566775 0.661342 8.417 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The direct (causal) effect of mass is negative, as revealed by a multiple regression.
Parameter estimates often change value or even sign when you include or remove other predictors.
library(plotly)
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
plot_ly(x=jumping$mass, y=jumping$height, z=jumping$jump,
type="scatter3d", mode="markers",
color = jumping$jump, alpha = 0.8) %>%
layout(
title = "Layout options in a 3d scatter plot",
scene = list(
xaxis = list(title = "mass"),
yaxis = list(title = "height"),
zaxis = list(title = "jumping distance")
))
Snow vole (that is a rodent) data. What explains variation in body mass?
voles <- read_csv("Data/voles.csv")
##
## ── Column specification ────────────────────────────────────────────────────────
## cols(
## id = col_character(),
## mass = col_double(),
## body_length = col_double(),
## tail_length = col_double(),
## sex = col_character(),
## age = col_character(),
## year = col_double(),
## x_coord = col_double(),
## y_coord = col_double(),
## day = col_double(),
## inbreeding = col_double()
## )
Are adult voles heavier than juveniles?
ggplot(voles, aes(x=age, y=mass)) + geom_violin() + geom_jitter(width = 0.1, height = 0)
## Warning: Removed 114 rows containing non-finite values (stat_ydensity).
## Warning: Removed 114 rows containing missing values (geom_point).
m0 <- glmmTMB(mass ~ 1 + age, data=voles)
## Warning in Matrix::sparseMatrix(dims = c(0, 0), i = integer(0), j =
## integer(0), : 'giveCsparse' has been deprecated; setting 'repr = "T"' for you
## Warning in Matrix::sparseMatrix(dims = c(0, 0), i = integer(0), j =
## integer(0), : 'giveCsparse' has been deprecated; setting 'repr = "T"' for you
## Warning in Matrix::sparseMatrix(dims = c(0, 0), i = integer(0), j =
## integer(0), : 'giveCsparse' has been deprecated; setting 'repr = "T"' for you
summary(m0)
## Family: gaussian ( identity )
## Formula: mass ~ 1 + age
## Data: voles
##
## AIC BIC logLik deviance df.resid
## 17976.8 17994.7 -8985.4 17970.8 2906
##
##
## Dispersion estimate for gaussian family (sigma^2): 28.2
##
## Conditional model:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 42.0208 0.1295 324.6 <2e-16 ***
## ageJ -15.9849 0.1994 -80.1 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Yes, on average, 16g heavier.
But, are they heavier given their body length?
ggplot(voles, aes(x=body_length, y=mass, color=age)) + geom_point()
## Warning: Removed 786 rows containing missing values (geom_point).
m1 <- glmmTMB(mass ~ 1 + body_length + age, data=voles)
## Warning in Matrix::sparseMatrix(dims = c(0, 0), i = integer(0), j =
## integer(0), : 'giveCsparse' has been deprecated; setting 'repr = "T"' for you
## Warning in Matrix::sparseMatrix(dims = c(0, 0), i = integer(0), j =
## integer(0), : 'giveCsparse' has been deprecated; setting 'repr = "T"' for you
## Warning in Matrix::sparseMatrix(dims = c(0, 0), i = integer(0), j =
## integer(0), : 'giveCsparse' has been deprecated; setting 'repr = "T"' for you
summary(m1)
## Family: gaussian ( identity )
## Formula: mass ~ 1 + body_length + age
## Data: voles
##
## AIC BIC logLik deviance df.resid
## 12562.7 12585.6 -6277.4 12554.7 2233
##
##
## Dispersion estimate for gaussian family (sigma^2): 16
##
## Conditional model:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -27.58046 1.62062 -17.02 <2e-16 ***
## body_length 0.59747 0.01389 43.03 <2e-16 ***
## ageJ -5.07780 0.30696 -16.54 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Given their length, adult voles are just 5g heavier than juveniles. But juveniles are also shorter, so overall the mass difference appears larger if you do not account for body length.
Which approach is correct? Both, it depends what your question is!
By the way, age and body length are highly correlated:
cor(1-as.integer(as.factor(voles$age)),
voles$body_length, use = "pairwise.complete.obs")
## [1] 0.8265616
Contrary to what you may have heard, high correlation between predictors is not a problem in linear models. But it is important to know about the correlation structure of predictors to interpret the results correctly.
Evaluate the model
sim_m1 <- simulateResiduals(m1)
## Warning in Matrix::sparseMatrix(dims = c(0, 0), i = integer(0), j =
## integer(0), : 'giveCsparse' has been deprecated; setting 'repr = "T"' for you
## Warning in Matrix::sparseMatrix(dims = c(0, 0), i = integer(0), j =
## integer(0), : 'giveCsparse' has been deprecated; setting 'repr = "T"' for you
## Warning in Matrix::sparseMatrix(dims = c(0, 0), i = integer(0), j =
## integer(0), : 'giveCsparse' has been deprecated; setting 'repr = "T"' for you
## Warning in Matrix::sparseMatrix(dims = c(0, 0), i = integer(0), j =
## integer(0), : 'giveCsparse' has been deprecated; setting 'repr = "T"' for you
## Warning in Matrix::sparseMatrix(dims = c(0, 0), i = integer(0), j =
## integer(0), : 'giveCsparse' has been deprecated; setting 'repr = "T"' for you
## Warning in Matrix::sparseMatrix(dims = c(0, 0), i = integer(0), j =
## integer(0), : 'giveCsparse' has been deprecated; setting 'repr = "T"' for you
plot(sim_m1)
Reasons for the lack of fit?
pred_voles <- as_tibble(emmeans(m1, spec = ~ age + body_length,
at=list(body_length=seq(min(voles$body_length, na.rm = TRUE),
max(voles$body_length, na.rm = TRUE)))))
ggplot(voles, aes(x=body_length, y=mass, color=age)) + geom_point() +
geom_line(data = pred_voles, aes(y=emmean), size=2)
## Warning: Removed 786 rows containing missing values (geom_point).
Probably non-linear scaling of masss with body length, interaction age-body length, probably heterogeneity not accounted for.
For instance, if you have 2 predictors, a simple linear model can be written as:
Response = Intercept + Estimate1 × Predictor1+ Estimate2 × Predictor2 + Error, with Error distributed following a random normal distribution.
or more mathematically as:
\[ y_i = \mu + \beta_1 x_{1,i} + \beta_2 x_{2,i} + \epsilon_i \text{, } \epsilon \sim N(0, V_R) \]
You can fit that model as:
glmmTMB(response ~ 1 + predictor1 + predictor2, data=data)
or equivalently as:
glmmTMB(response ~ predictor1 + predictor2, data=data)
or equivalently as:
glmmTMB(response ~ predictor2 + predictor1, data=data)
For simple models you can use lm()
instead of glmmTMB()
.
Syntax rules: