GLM 2: count data

Today:

Fit generalized linear models for count data
Predict and visualise model output
Understand issues of over-dispersion and how to address them

Packages:

library(ggplot2)
library(performance)
library(DHARMa)
library(glmmTMB)

## Registered S3 method overwritten by 'glmmTMB':
##   method        from  
##   refit.glmmTMB DHARMa

Data:

download.file("https://timotheenivalis.github.io/data/reproduction.csv", 
              destfile = "data/reproduction.csv")
download.file("https://timotheenivalis.github.io/data/voles.csv", 
              destfile = "data/voles.csv")

Count data properties

Ideal count data produced by “Poisson” process. The Poisson distribution has a single parameter: the expected count, lambda.

randomcount <- rpois(n = 10000, lambda = 3.2)

mean(randomcount)

## [1] 3.1782

var(randomcount)

## [1] 3.136758

range(randomcount)

## [1]  0 11

summary(randomcount)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   2.000   3.000   3.178   4.000  11.000

randomcount <- rpois(n = 10000, lambda = 10.8)

mean(randomcount)

## [1] 10.7749

var(randomcount)

## [1] 10.84571

range(randomcount)

## [1]  1 24

summary(randomcount)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    8.00   11.00   10.77   13.00   24.00

ggplot(data.frame(x=randomcount), aes(x=x))+
  geom_histogram(binwidth = 0.5,aes(y=stat(count) / sum(count)))

The mean of a true Poisson variable equals its variance. More in general in count variables that are not perfect Poisson, the variance is a function of the mean.

x <- seq(from = 1, to=10, by=1)

dat <- data.frame(x= rep(x, each=1000),
                  y=as.vector(sapply(x, function(x) {rpois(n = 1000, lambda = x)})))

ggplot(dat, aes(x=as.factor(x), y=y)) + geom_boxplot()

Fitting count data GLMs in R

The three components of a glm:

A linear model: response = intercept + predictor*slope …
A link function between the scale of the data (positive intergers) and the scale of the linear model (-inf to +inf). For count data generally log().
A random distribution linking expected data (on average 1.27) to actual data (0,1,2,3…). For count data, the Poisson distribution or other related distributions.

The simplest solution is:

glm(obs ~ 1 + x , family = "poisson", data=data)

However you should NEVER use this. glm( family=“poisson”) assumes there is no unexplained variation apart from the incompressible Poisson random process, that is, there is no unexplained variation in expected counts, or \(V(\mathrm{exp}(Y)) = E(\mathrm{exp}(Y))\). This is almost never the case with biological data. Most of the time there is lots of unexplained variance and are over-dispersed. On the other hand, some biological processes are “conservative”/under-dispersed and have \(V(\mathrm{exp}(y)) < E(\mathrm{exp}(y))\) (for instance bird clutch size within species).

When you use simple Poisson glm on over-dispersed data you artificially decrease SE and p-value: you are going to find lots of false positive. When you use simple Poisson glm on under-dispersed data you artificially increase SE and p-value: you loose statistical power.

There are at least three options to obtain proper inference/prediction with count data:

Use quasi-likelihood methods with approximate correction of uncertainty. That is available as glm(family=“quasipoisson”). Easy but not exact.
Use generalized Poisson families such as the negative-binomial.
If the data are assumed to be over-dispersed conditional on predictors, use an “observation-level random effect”; that’s equivalent to adding a residual to the Poisson model.

Practice

Load the data reproduction.csv

repro <- read.csv("data/reproduction.csv")
str(repro)

## 'data.frame':    300 obs. of  3 variables:
##  $ reproduction: int  0 0 3 1 0 3 1 10 0 3 ...
##  $ size        : num  1.59 4.55 6.17 2.69 4.86 ...
##  $ sex         : int  0 1 0 0 0 1 0 1 0 0 ...

summary(repro)

##   reproduction         size             sex        
##  Min.   :  0.00   Min.   : 1.027   Min.   :0.0000  
##  1st Qu.:  0.00   1st Qu.: 2.735   1st Qu.:0.0000  
##  Median :  1.00   Median : 4.001   Median :1.0000  
##  Mean   :  3.62   Mean   : 4.139   Mean   :0.5033  
##  3rd Qu.:  3.00   3rd Qu.: 5.279   3rd Qu.:1.0000  
##  Max.   :169.00   Max.   :10.088   Max.   :1.0000

Plot reproduction data, calculate the mean and variance in reproduction.

mean(repro$reproduction)

## [1] 3.62

var(repro$reproduction)

## [1] 134.2565

ggplot(repro, aes(x=reproduction)) + geom_density()

The variance is larger than the mean; data are currently over-dispersed (although it is possible that data are not over-dispersed conditional on predictors.)

Overlay a Gaussian distribution of same mean and variance, does it fit?

normdens <- data.frame(x=seq(-10,max(repro$reproduction), by = 0.1), 
                       y=dnorm(x=seq(-10, max(repro$reproduction), by = 0.1),
                  mean = mean(repro$reproduction),
                  sd = sd(repro$reproduction)))

ggplot(repro, aes(x=reproduction)) + geom_histogram(aes(y=stat(count) / sum(count))) +
  geom_line(data=normdens, aes(x = x, y=y), inherit.aes = FALSE, color="red")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Fit an compare a lm and a Poisson glm of reproduction on size

lmrepro <- lm(reproduction ~ size, data=repro)
summary(lmrepro)

## 
## Call:
## lm(formula = reproduction ~ size, data = repro)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -11.314  -3.439  -1.120   1.013 158.288 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -4.0484     1.5900  -2.546   0.0114 *  
## size          1.8526     0.3515   5.270 2.62e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11.1 on 298 degrees of freedom
## Multiple R-squared:  0.08526,    Adjusted R-squared:  0.08219 
## F-statistic: 27.77 on 1 and 298 DF,  p-value: 2.62e-07

glmrepro <- glm(reproduction ~ size, data = repro, family = "quasipoisson")
summary(glmrepro)

## 
## Call:
## glm(formula = reproduction ~ size, family = "quasipoisson", data = repro)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -5.5644  -1.8016  -1.1823   0.1392  23.4704  
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.83080    0.31504  -2.637   0.0088 ** 
## size         0.42785    0.05075   8.431 1.49e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for quasipoisson family taken to be 12.3918)
## 
##     Null deviance: 2892.0  on 299  degrees of freedom
## Residual deviance: 2063.5  on 298  degrees of freedom
## AIC: NA
## 
## Number of Fisher Scoring iterations: 6

ggplot(repro, aes(x=size, y=reproduction)) + geom_point()+
  geom_smooth(method="lm",fullrange=TRUE, color="orange", fill="orange", alpha=0.2) +
  geom_smooth(method="glm", method.args=list(family="poisson"), color="red", fill="red", fullrange=TRUE, alpha=0.2) +
  geom_smooth(method="glm", method.args=list(family="quasipoisson"), color="blue", fill="blue", fullrange=TRUE, alpha=0.2) +
  ylim(c(-1,30)) + xlim(c(0,max(repro$size)))

## `geom_smooth()` using formula 'y ~ x'

## Warning: Removed 3 rows containing non-finite values (stat_smooth).

## `geom_smooth()` using formula 'y ~ x'

## Warning: Removed 3 rows containing non-finite values (stat_smooth).

## `geom_smooth()` using formula 'y ~ x'

## Warning: Removed 3 rows containing non-finite values (stat_smooth).

## Warning: Removed 3 rows containing missing values (geom_point).

Let’s build the prediction line by hand to understand the back-transformation.

sizex <- seq(from=0, to=10, by=0.1)
reprolatent <- coef(glmrepro)[1] + sizex*coef(glmrepro)[2]
#equivalent to:
predict(glmrepro, type = "link", newdata = data.frame(size=sizex))

##           1           2           3           4           5           6 
## -0.83080074 -0.78801586 -0.74523099 -0.70244612 -0.65966124 -0.61687637 
##           7           8           9          10          11          12 
## -0.57409150 -0.53130663 -0.48852175 -0.44573688 -0.40295201 -0.36016713 
##          13          14          15          16          17          18 
## -0.31738226 -0.27459739 -0.23181252 -0.18902764 -0.14624277 -0.10345790 
##          19          20          21          22          23          24 
## -0.06067302 -0.01788815  0.02489672  0.06768160  0.11046647  0.15325134 
##          25          26          27          28          29          30 
##  0.19603621  0.23882109  0.28160596  0.32439083  0.36717571  0.40996058 
##          31          32          33          34          35          36 
##  0.45274545  0.49553033  0.53831520  0.58110007  0.62388494  0.66666982 
##          37          38          39          40          41          42 
##  0.70945469  0.75223956  0.79502444  0.83780931  0.88059418  0.92337905 
##          43          44          45          46          47          48 
##  0.96616393  1.00894880  1.05173367  1.09451855  1.13730342  1.18008829 
##          49          50          51          52          53          54 
##  1.22287317  1.26565804  1.30844291  1.35122778  1.39401266  1.43679753 
##          55          56          57          58          59          60 
##  1.47958240  1.52236728  1.56515215  1.60793702  1.65072189  1.69350677 
##          61          62          63          64          65          66 
##  1.73629164  1.77907651  1.82186139  1.86464626  1.90743113  1.95021601 
##          67          68          69          70          71          72 
##  1.99300088  2.03578575  2.07857062  2.12135550  2.16414037  2.20692524 
##          73          74          75          76          77          78 
##  2.24971012  2.29249499  2.33527986  2.37806473  2.42084961  2.46363448 
##          79          80          81          82          83          84 
##  2.50641935  2.54920423  2.59198910  2.63477397  2.67755885  2.72034372 
##          85          86          87          88          89          90 
##  2.76312859  2.80591346  2.84869834  2.89148321  2.93426808  2.97705296 
##          91          92          93          94          95          96 
##  3.01983783  3.06262270  3.10540757  3.14819245  3.19097732  3.23376219 
##          97          98          99         100         101 
##  3.27654707  3.31933194  3.36211681  3.40490169  3.44768656

pred1 <- data.frame(sizex=sizex, reprolatent=reprolatent)

ggplot(repro, aes(x=size, y=reproduction)) + geom_point() + 
  geom_line(data = pred1, aes(x=sizex, y=reprolatent), color="red")

pred1$reprodata <- exp(pred1$reprolatent)
#equivalent to:
predict(glmrepro, type = "response", newdata = data.frame(size=sizex))

##          1          2          3          4          5          6          7 
##  0.4357003  0.4547462  0.4746247  0.4953721  0.5170265  0.5396274  0.5632163 
##          8          9         10         11         12         13         14 
##  0.5878364  0.6135327  0.6403522  0.6683442  0.6975597  0.7280524  0.7598780 
##         15         16         17         18         19         20         21 
##  0.7930948  0.8277636  0.8639479  0.9017140  0.9411309  0.9822709  1.0252092 
##         22         23         24         25         26         27         28 
##  1.0700246  1.1167989  1.1656179  1.2165710  1.2697513  1.3252564  1.3831878 
##         29         30         31         32         33         34         35 
##  1.4436516  1.5067584  1.5726238  1.6413685  1.7131182  1.7880043  1.8661639 
##         36         37         38         39         40         41         42 
##  1.9477402  2.0328824  2.1217465  2.2144951  2.3112981  2.4123326  2.5177838 
##         43         44         45         46         47         48         49 
##  2.6278445  2.7427164  2.8626096  2.9877439  3.1183481  3.2546616  3.3969337 
##         50         51         52         53         54         55         56 
##  3.5454250  3.7004074  3.8621645  4.0309926  4.2072008  4.3911116  4.5830617 
##         57         58         59         60         61         62         63 
##  4.7834027  4.9925012  5.2107401  5.4385189  5.6762548  5.9243828  6.1833574 
##         64         65         66         67         68         69         70 
##  6.4536526  6.7357633  7.0302060  7.3375198  7.6582673  7.9930357  8.3424380 
##         71         72         73         74         75         76         77 
##  8.7071138  9.0877308  9.4849859  9.8996063 10.3323512 10.7840127 11.2554179 
##         78         79         80         81         82         83         84 
## 11.7474298 12.2609492 12.7969163 13.3563122 13.9401612 14.5495322 15.1855409 
##         85         86         87         88         89         90         91 
## 15.8493516 16.5421797 17.2652936 18.0200173 18.8077324 19.6298812 20.4879689 
##         92         93         94         95         96         97         98 
## 21.3835664 22.3183135 23.2939215 24.3121766 25.3749431 26.4841666 27.6418779 
##         99        100        101 
## 28.8501967 30.1113352 31.4276022

ggplot(repro, aes(x=size, y=reproduction)) + geom_point() + 
  geom_line(data = pred1, aes(x=sizex, y=reprolatent), color="red")+
  geom_line(data = pred1, aes(x=sizex, y=reprodata), color="blue")

Before GLMs, researchers used to log-transform the data and fit linear models. What are the problems with this approach?

summary(lm(log(reproduction) ~ size, data=repro))

summary(lm(log(reproduction+1) ~ size, data=repro))

summary(lm(log(reproduction+0.1) ~ size, data=repro))

summary(lm(log(reproduction+0.001) ~ size, data=repro))

summary(lm(log(reproduction+0.000000000000000001) ~ size, data=repro))

#let's extract the slope parameter for smaller and smaller deviations from zero

sapply(X = 1:30, function(x){
  coef(lm(log(reproduction+10^(-x)) ~ size, data=repro))[2]
  }
  )

You cannot take the log of zero. The value of the slope is dependent on the arbitrary choice of a small deviation from zero.

glmmTMB

Better solution yet, use a specialized package like glmmTMB

glmrepro2 <- glmmTMB(reproduction ~ size, data = repro, family = "nbinom1")
summary(glmrepro2)

##  Family: nbinom1  ( log )
## Formula:          reproduction ~ size
## Data: repro
## 
##      AIC      BIC   logLik deviance df.resid 
##   1343.4   1354.5   -668.7   1337.4      297 
## 
## 
## Overdispersion parameter for nbinom1 family (): 7.11 
## 
## Conditional model:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  0.29963    0.19273   1.555     0.12    
## size         0.21793    0.03485   6.254 4.01e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

newdat <- data.frame(size=seq(from=0, to=10, by=0.1))

predictobject <- predict(glmrepro2, type="link", se.fit = TRUE, newdata = newdat)
newdat$fit <- exp(predictobject$fit)
newdat$lowerCI <- exp(predictobject$fit - 1.96*predictobject$se.fit)
newdat$upperCI <- exp(predictobject$fit + 1.96*predictobject$se.fit)

ggplot(repro, aes(x=size, y=reproduction)) + geom_point()+
  geom_smooth(method="glm", method.args=list(family="quasipoisson"), color="blue", fill="blue", fullrange=TRUE, alpha=0.2) +
  geom_line(data=newdat, mapping = aes(x=size, y=fit))+
  geom_ribbon(inherit.aes = FALSE, data=newdat, aes(x=size, ymin=lowerCI, ymax=upperCI), alpha=0.2)+
  ylim(c(-1,30)) + xlim(c(0,max(repro$size)))

## `geom_smooth()` using formula 'y ~ x'

## Warning: Removed 3 rows containing non-finite values (stat_smooth).

## Warning: Removed 3 rows containing missing values (geom_point).

Check the diagnostic plots. Should you be worried?

plot(lmrepro)

check_model(lmrepro)

## Not enough model terms in the conditional part of the model to check for multicollinearity.

## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning: Removed 298 rows containing missing values (geom_text_repel).

plot(lmrepro)

check_model(glmrepro)

## Not enough model terms in the conditional part of the model to check for multicollinearity.

## `geom_smooth()` using formula 'y ~ x'

## `geom_smooth()` using formula 'y ~ x'

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning: Removed 299 rows containing missing values (geom_text_repel).

#plot(glmrepro2)#this is not available
check_model(glmrepro2)

## Not enough model terms in the conditional part of the model to check for multicollinearity.

## `geom_smooth()` using formula 'y ~ x'

## `geom_smooth()` using formula 'y ~ x'

No worries. GLMs have different assumptions than lm(). Plots for glm generally look weird even if everything is good.

To assess the performance of the model you need to simulate residuals according to model estimates and then compare the distribution of some properties between the simulated and the real residuals. The process is automated in DHARMa (but is not availble for quasi-Poisson models but it does not matter, the residuals are the same for poisson and quasi-Poisson).

sim1 <- simulateResiduals(glm(reproduction ~ size, data = repro, family = "poisson"))
testResiduals(sim1)

## $uniformity
## 
##  One-sample Kolmogorov-Smirnov test
## 
## data:  simulationOutput$scaledResiduals
## D = 0.25334, p-value < 2.2e-16
## alternative hypothesis: two-sided
## 
## 
## $dispersion
## 
##  DHARMa nonparametric dispersion test via sd of residuals fitted vs.
##  simulated
## 
## data:  simulationOutput
## ratioObsSim = 5.734, p-value < 2.2e-16
## alternative hypothesis: two.sided
## 
## 
## $outliers
## 
##  DHARMa outlier test based on exact binomial test
## 
## data:  simulationOutput
## outLow = 2.5000e+01, outHigh = 2.1000e+01, nobs = 3.0000e+02, freqH0 =
## 3.9841e-03, p-value < 2.2e-16
## alternative hypothesis: two.sided

## $uniformity
## 
##  One-sample Kolmogorov-Smirnov test
## 
## data:  simulationOutput$scaledResiduals
## D = 0.25334, p-value < 2.2e-16
## alternative hypothesis: two-sided
## 
## 
## $dispersion
## 
##  DHARMa nonparametric dispersion test via sd of residuals fitted vs.
##  simulated
## 
## data:  simulationOutput
## ratioObsSim = 5.734, p-value < 2.2e-16
## alternative hypothesis: two.sided
## 
## 
## $outliers
## 
##  DHARMa outlier test based on exact binomial test
## 
## data:  simulationOutput
## outLow = 2.5000e+01, outHigh = 2.1000e+01, nobs = 3.0000e+02, freqH0 =
## 3.9841e-03, p-value < 2.2e-16
## alternative hypothesis: two.sided

sim2 <- simulateResiduals(glmrepro2)
testResiduals(sim2)

## $uniformity
## 
##  One-sample Kolmogorov-Smirnov test
## 
## data:  simulationOutput$scaledResiduals
## D = 0.091879, p-value = 0.01263
## alternative hypothesis: two-sided
## 
## 
## $dispersion
## 
##  DHARMa nonparametric dispersion test via sd of residuals fitted vs.
##  simulated
## 
## data:  simulationOutput
## ratioObsSim = 2.0856, p-value < 2.2e-16
## alternative hypothesis: two.sided
## 
## 
## $outliers
## 
##  DHARMa outlier test based on exact binomial test
## 
## data:  simulationOutput
## outLow = 1.0000e+00, outHigh = 2.0000e+00, nobs = 3.0000e+02, freqH0 =
## 3.9841e-03, p-value = 0.6716
## alternative hypothesis: two.sided

## $uniformity
## 
##  One-sample Kolmogorov-Smirnov test
## 
## data:  simulationOutput$scaledResiduals
## D = 0.091879, p-value = 0.01263
## alternative hypothesis: two-sided
## 
## 
## $dispersion
## 
##  DHARMa nonparametric dispersion test via sd of residuals fitted vs.
##  simulated
## 
## data:  simulationOutput
## ratioObsSim = 2.0856, p-value < 2.2e-16
## alternative hypothesis: two.sided
## 
## 
## $outliers
## 
##  DHARMa outlier test based on exact binomial test
## 
## data:  simulationOutput
## outLow = 1.0000e+00, outHigh = 2.0000e+00, nobs = 3.0000e+02, freqH0 =
## 3.9841e-03, p-value = 0.6716
## alternative hypothesis: two.sided

So the poisson fitted in glm() is really bad; it does not capture variation. The model fitted in glmmTMB is not ideal either, but much better.

The second family of negative binomial performs better:

glmrepro3 <- glmmTMB(reproduction ~ size, data = repro, family = "nbinom2")

sim3 <- simulateResiduals(glmrepro3)
testResiduals(sim3)

## $uniformity
## 
##  One-sample Kolmogorov-Smirnov test
## 
## data:  simulationOutput$scaledResiduals
## D = 0.06909, p-value = 0.1141
## alternative hypothesis: two-sided
## 
## 
## $dispersion
## 
##  DHARMa nonparametric dispersion test via sd of residuals fitted vs.
##  simulated
## 
## data:  simulationOutput
## ratioObsSim = 1.8154, p-value < 2.2e-16
## alternative hypothesis: two.sided
## 
## 
## $outliers
## 
##  DHARMa outlier test based on exact binomial test
## 
## data:  simulationOutput
## outLow = 3.0000e+00, outHigh = 3.0000e+00, nobs = 3.0000e+02, freqH0 =
## 3.9841e-03, p-value = 0.2383
## alternative hypothesis: two.sided

## $uniformity
## 
##  One-sample Kolmogorov-Smirnov test
## 
## data:  simulationOutput$scaledResiduals
## D = 0.06909, p-value = 0.1141
## alternative hypothesis: two-sided
## 
## 
## $dispersion
## 
##  DHARMa nonparametric dispersion test via sd of residuals fitted vs.
##  simulated
## 
## data:  simulationOutput
## ratioObsSim = 1.8154, p-value < 2.2e-16
## alternative hypothesis: two.sided
## 
## 
## $outliers
## 
##  DHARMa outlier test based on exact binomial test
## 
## data:  simulationOutput
## outLow = 3.0000e+00, outHigh = 3.0000e+00, nobs = 3.0000e+02, freqH0 =
## 3.9841e-03, p-value = 0.2383
## alternative hypothesis: two.sided

Finally, in the case of over-dispersion (it does not work with under-dispersion!) we can model extra variance using an observation-level random effect:

repro$row <- 1:nrow(repro)
glmrepro4 <- glmmTMB(reproduction ~ sex +size + (1|row), 
                   data = repro, family = "poisson")

sim4 <- simulateResiduals(glmrepro4)
testResiduals(sim4)

## $uniformity
## 
##  One-sample Kolmogorov-Smirnov test
## 
## data:  simulationOutput$scaledResiduals
## D = 0.043371, p-value = 0.6251
## alternative hypothesis: two-sided
## 
## 
## $dispersion
## 
##  DHARMa nonparametric dispersion test via sd of residuals fitted vs.
##  simulated
## 
## data:  simulationOutput
## ratioObsSim = 1.5972, p-value = 0.128
## alternative hypothesis: two.sided
## 
## 
## $outliers
## 
##  DHARMa outlier test based on exact binomial test
## 
## data:  simulationOutput
## outLow = 1.0000e+00, outHigh = 1.0000e+00, nobs = 3.0000e+02, freqH0 =
## 3.9841e-03, p-value = 1
## alternative hypothesis: two.sided

## $uniformity
## 
##  One-sample Kolmogorov-Smirnov test
## 
## data:  simulationOutput$scaledResiduals
## D = 0.043371, p-value = 0.6251
## alternative hypothesis: two-sided
## 
## 
## $dispersion
## 
##  DHARMa nonparametric dispersion test via sd of residuals fitted vs.
##  simulated
## 
## data:  simulationOutput
## ratioObsSim = 1.5972, p-value = 0.128
## alternative hypothesis: two.sided
## 
## 
## $outliers
## 
##  DHARMa outlier test based on exact binomial test
## 
## data:  simulationOutput
## outLow = 1.0000e+00, outHigh = 1.0000e+00, nobs = 3.0000e+02, freqH0 =
## 3.9841e-03, p-value = 1
## alternative hypothesis: two.sided

Here the random effect model provides the best fit to the data and (probably) the correct amount of uncertainty for parameter estimates. However this works only for over-dispersion and is more difficult to work with (you need to think about including random effect variance in predictions). For other data the negative binomial models will perform better.

Complementary way to assess model: predictions

pred4 <- predict(glmrepro4, type = "response")
mean(pred4) ; mean(repro$reproduction)

## [1] 3.522955

## [1] 3.62

var(pred4) ; var(repro$reproduction)

## [1] 126.3489

## [1] 134.2565

hist(pred4); hist(repro$reproduction)

pred3 <- predict(glmrepro3, type = "response")
mean(pred3) ; mean(repro$reproduction)

## [1] 3.517754

## [1] 3.62

var(pred3) ; var(repro$reproduction)

## [1] 11.51497

## [1] 134.2565

hist(pred3); hist(repro$reproduction)

General appraoch:

Define a model formula corresponding to your scientific goal (test hypothesis, make prediction…)
Choose a potentially appropriate family (poisson + random effect, negative binomials…)
Visualise model predictions, compare mean, variance, histogram… does it fit?
Look at diagnostics from simulated residuals. Is there much unexplained variation left? Are there strong outliers?
From 3 and 4, can I think of potentially better model family? If yes, restart at 2.
Conclude

Practice Factorial model

glmreprosex <- glmmTMB(reproduction ~ sex + (1|row),
                       data = repro, family = "poisson")
summary(glmreprosex)

##  Family: poisson  ( log )
## Formula:          reproduction ~ sex + (1 | row)
## Data: repro
## 
##      AIC      BIC   logLik deviance df.resid 
##   1315.7   1326.8   -654.9   1309.7      297 
## 
## Random effects:
## 
## Conditional model:
##  Groups Name        Variance Std.Dev.
##  row    (Intercept) 1.729    1.315   
## Number of obs: 300, groups:  row, 300
## 
## Conditional model:
##             Estimate Std. Error z value Pr(>|z|)  
## (Intercept)  0.03565    0.14444   0.247   0.8050  
## sex          0.43713    0.18588   2.352   0.0187 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

tapply(X = predict(glmreprosex, re.form = NULL, type = "response"),
       INDEX = repro$sex, mean)

##        0        1 
## 3.390593 3.660415

simrs1 <- simulateResiduals(glmreprosex)
testResiduals(simrs1)

## $uniformity
## 
##  One-sample Kolmogorov-Smirnov test
## 
## data:  simulationOutput$scaledResiduals
## D = 0.032701, p-value = 0.9054
## alternative hypothesis: two-sided
## 
## 
## $dispersion
## 
##  DHARMa nonparametric dispersion test via sd of residuals fitted vs.
##  simulated
## 
## data:  simulationOutput
## ratioObsSim = 1.6997, p-value = 0.072
## alternative hypothesis: two.sided
## 
## 
## $outliers
## 
##  DHARMa outlier test based on exact binomial test
## 
## data:  simulationOutput
## outLow = 2.0000e+00, outHigh = 3.0000e+00, nobs = 3.0000e+02, freqH0 =
## 3.9841e-03, p-value = 0.2383
## alternative hypothesis: two.sided

## $uniformity
## 
##  One-sample Kolmogorov-Smirnov test
## 
## data:  simulationOutput$scaledResiduals
## D = 0.032701, p-value = 0.9054
## alternative hypothesis: two-sided
## 
## 
## $dispersion
## 
##  DHARMa nonparametric dispersion test via sd of residuals fitted vs.
##  simulated
## 
## data:  simulationOutput
## ratioObsSim = 1.6997, p-value = 0.072
## alternative hypothesis: two.sided
## 
## 
## $outliers
## 
##  DHARMa outlier test based on exact binomial test
## 
## data:  simulationOutput
## outLow = 2.0000e+00, outHigh = 3.0000e+00, nobs = 3.0000e+02, freqH0 =
## 3.9841e-03, p-value = 0.2383
## alternative hypothesis: two.sided

glmreprosex2 <- glmmTMB(reproduction ~ sex ,
                       data = repro, family = "nbinom1")
summary(glmreprosex)

##  Family: poisson  ( log )
## Formula:          reproduction ~ sex + (1 | row)
## Data: repro
## 
##      AIC      BIC   logLik deviance df.resid 
##   1315.7   1326.8   -654.9   1309.7      297 
## 
## Random effects:
## 
## Conditional model:
##  Groups Name        Variance Std.Dev.
##  row    (Intercept) 1.729    1.315   
## Number of obs: 300, groups:  row, 300
## 
## Conditional model:
##             Estimate Std. Error z value Pr(>|z|)  
## (Intercept)  0.03565    0.14444   0.247   0.8050  
## sex          0.43713    0.18588   2.352   0.0187 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

tapply(X = predict(glmreprosex, type = "response"),
       INDEX = repro$sex, mean)

##        0        1 
## 3.390593 3.660415

simrs2 <- simulateResiduals(glmreprosex2)
testResiduals(simrs2)

## $uniformity
## 
##  One-sample Kolmogorov-Smirnov test
## 
## data:  simulationOutput$scaledResiduals
## D = 0.088339, p-value = 0.01852
## alternative hypothesis: two-sided
## 
## 
## $dispersion
## 
##  DHARMa nonparametric dispersion test via sd of residuals fitted vs.
##  simulated
## 
## data:  simulationOutput
## ratioObsSim = 2.0528, p-value < 2.2e-16
## alternative hypothesis: two.sided
## 
## 
## $outliers
## 
##  DHARMa outlier test based on exact binomial test
## 
## data:  simulationOutput
## outLow = 0.0000e+00, outHigh = 4.0000e+00, nobs = 3.0000e+02, freqH0 =
## 3.9841e-03, p-value = 0.06609
## alternative hypothesis: two.sided

## $uniformity
## 
##  One-sample Kolmogorov-Smirnov test
## 
## data:  simulationOutput$scaledResiduals
## D = 0.088339, p-value = 0.01852
## alternative hypothesis: two-sided
## 
## 
## $dispersion
## 
##  DHARMa nonparametric dispersion test via sd of residuals fitted vs.
##  simulated
## 
## data:  simulationOutput
## ratioObsSim = 2.0528, p-value < 2.2e-16
## alternative hypothesis: two.sided
## 
## 
## $outliers
## 
##  DHARMa outlier test based on exact binomial test
## 
## data:  simulationOutput
## outLow = 0.0000e+00, outHigh = 4.0000e+00, nobs = 3.0000e+02, freqH0 =
## 3.9841e-03, p-value = 0.06609
## alternative hypothesis: two.sided

glmreprosex3 <- glmmTMB(reproduction ~ sex ,
                       data = repro, family = "nbinom2")
summary(glmreprosex3)

##  Family: nbinom2  ( log )
## Formula:          reproduction ~ sex
## Data: repro
## 
##      AIC      BIC   logLik deviance df.resid 
##   1378.1   1389.2   -686.1   1372.1      297 
## 
## 
## Overdispersion parameter for nbinom2 family (): 0.439 
## 
## Conditional model:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  1.24988    0.13122   9.525   <2e-16 ***
## sex          0.07143    0.18460   0.387    0.699    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

tapply(X = predict(glmreprosex, type = "response"),
       INDEX = repro$sex, mean)

##        0        1 
## 3.390593 3.660415

simrs3 <- simulateResiduals(glmreprosex3)
testResiduals(simrs3)

## $uniformity
## 
##  One-sample Kolmogorov-Smirnov test
## 
## data:  simulationOutput$scaledResiduals
## D = 0.084508, p-value = 0.02755
## alternative hypothesis: two-sided
## 
## 
## $dispersion
## 
##  DHARMa nonparametric dispersion test via sd of residuals fitted vs.
##  simulated
## 
## data:  simulationOutput
## ratioObsSim = 2.0349, p-value < 2.2e-16
## alternative hypothesis: two.sided
## 
## 
## $outliers
## 
##  DHARMa outlier test based on exact binomial test
## 
## data:  simulationOutput
## outLow = 1.0000e+00, outHigh = 3.0000e+00, nobs = 3.0000e+02, freqH0 =
## 3.9841e-03, p-value = 0.2383
## alternative hypothesis: two.sided

## $uniformity
## 
##  One-sample Kolmogorov-Smirnov test
## 
## data:  simulationOutput$scaledResiduals
## D = 0.084508, p-value = 0.02755
## alternative hypothesis: two-sided
## 
## 
## $dispersion
## 
##  DHARMa nonparametric dispersion test via sd of residuals fitted vs.
##  simulated
## 
## data:  simulationOutput
## ratioObsSim = 2.0349, p-value < 2.2e-16
## alternative hypothesis: two.sided
## 
## 
## $outliers
## 
##  DHARMa outlier test based on exact binomial test
## 
## data:  simulationOutput
## outLow = 1.0000e+00, outHigh = 3.0000e+00, nobs = 3.0000e+02, freqH0 =
## 3.9841e-03, p-value = 0.2383
## alternative hypothesis: two.sided

Practice 2

voles <- read.csv("data/voles.csv")
summary(voles)

##         id            year          sex      age          mass      
##  CN07-096:   4   Min.   :2006   Female:686   A:530   Min.   :11.00  
##  CN07-113:   4   1st Qu.:2007   Male  :608   J:764   1st Qu.:23.00  
##  CN07-114:   4   Median :2009                        Median :29.00  
##  CN06-036:   3   Mean   :2010                        Mean   :30.97  
##  CN06-065:   3   3rd Qu.:2013                        3rd Qu.:40.00  
##  CN06-089:   3   Max.   :2015                        Max.   :62.00  
##  (Other) :1273                                       NA's   :26     
##     survival       reproduction       fitness           StMass       
##  Min.   :0.0000   Min.   : 0.000   Min.   : 0.000   Min.   :-1.9245  
##  1st Qu.:0.0000   1st Qu.: 0.000   1st Qu.: 0.000   1st Qu.:-0.7683  
##  Median :0.0000   Median : 0.000   Median : 0.000   Median :-0.1902  
##  Mean   :0.2141   Mean   : 1.083   Mean   : 1.512   Mean   : 0.0000  
##  3rd Qu.:0.0000   3rd Qu.: 1.000   3rd Qu.: 2.000   3rd Qu.: 0.8696  
##  Max.   :1.0000   Max.   :15.000   Max.   :15.000   Max.   : 2.9893  
##                                                     NA's   :26       
##       mother   
##  CN13-002: 20  
##  CN06-104: 17  
##  CN07-125: 17  
##  CN06-050: 15  
##  CN07-036: 15  
##  (Other) :999  
##  NA's    :211

Model reproduction