Today:

Packages:

library(ggplot2)
library(performance)
library(DHARMa)
library(glmmTMB)
## Registered S3 method overwritten by 'glmmTMB':
##   method        from  
##   refit.glmmTMB DHARMa

Data:

download.file("https://timotheenivalis.github.io/data/reproduction.csv", 
              destfile = "data/reproduction.csv")
download.file("https://timotheenivalis.github.io/data/voles.csv", 
              destfile = "data/voles.csv")

Count data properties

Ideal count data produced by “Poisson” process. The Poisson distribution has a single parameter: the expected count, lambda.

randomcount <- rpois(n = 10000, lambda = 3.2)

mean(randomcount)
## [1] 3.1782
var(randomcount)
## [1] 3.136758
range(randomcount)
## [1]  0 11
summary(randomcount)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   2.000   3.000   3.178   4.000  11.000
randomcount <- rpois(n = 10000, lambda = 10.8)

mean(randomcount)
## [1] 10.7749
var(randomcount)
## [1] 10.84571
range(randomcount)
## [1]  1 24
summary(randomcount)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    8.00   11.00   10.77   13.00   24.00
ggplot(data.frame(x=randomcount), aes(x=x))+
  geom_histogram(binwidth = 0.5,aes(y=stat(count) / sum(count)))

The mean of a true Poisson variable equals its variance. More in general in count variables that are not perfect Poisson, the variance is a function of the mean.

x <- seq(from = 1, to=10, by=1)

dat <- data.frame(x= rep(x, each=1000),
                  y=as.vector(sapply(x, function(x) {rpois(n = 1000, lambda = x)})))

ggplot(dat, aes(x=as.factor(x), y=y)) + geom_boxplot()

Fitting count data GLMs in R

The three components of a glm:

  1. A linear model: response = intercept + predictor*slope …
  2. A link function between the scale of the data (positive intergers) and the scale of the linear model (-inf to +inf). For count data generally log().
  3. A random distribution linking expected data (on average 1.27) to actual data (0,1,2,3…). For count data, the Poisson distribution or other related distributions.

The simplest solution is:

glm(obs ~ 1 + x , family = "poisson", data=data)

However you should NEVER use this. glm( family=“poisson”) assumes there is no unexplained variation apart from the incompressible Poisson random process, that is, there is no unexplained variation in expected counts, or \(V(\mathrm{exp}(Y)) = E(\mathrm{exp}(Y))\). This is almost never the case with biological data. Most of the time there is lots of unexplained variance and are over-dispersed. On the other hand, some biological processes are “conservative”/under-dispersed and have \(V(\mathrm{exp}(y)) < E(\mathrm{exp}(y))\) (for instance bird clutch size within species).

When you use simple Poisson glm on over-dispersed data you artificially decrease SE and p-value: you are going to find lots of false positive. When you use simple Poisson glm on under-dispersed data you artificially increase SE and p-value: you loose statistical power.

There are at least three options to obtain proper inference/prediction with count data:

Practice

  1. Load the data reproduction.csv
repro <- read.csv("data/reproduction.csv")
str(repro)
## 'data.frame':    300 obs. of  3 variables:
##  $ reproduction: int  0 0 3 1 0 3 1 10 0 3 ...
##  $ size        : num  1.59 4.55 6.17 2.69 4.86 ...
##  $ sex         : int  0 1 0 0 0 1 0 1 0 0 ...
summary(repro)
##   reproduction         size             sex        
##  Min.   :  0.00   Min.   : 1.027   Min.   :0.0000  
##  1st Qu.:  0.00   1st Qu.: 2.735   1st Qu.:0.0000  
##  Median :  1.00   Median : 4.001   Median :1.0000  
##  Mean   :  3.62   Mean   : 4.139   Mean   :0.5033  
##  3rd Qu.:  3.00   3rd Qu.: 5.279   3rd Qu.:1.0000  
##  Max.   :169.00   Max.   :10.088   Max.   :1.0000
  1. Plot reproduction data, calculate the mean and variance in reproduction.
mean(repro$reproduction)
## [1] 3.62
var(repro$reproduction)
## [1] 134.2565
ggplot(repro, aes(x=reproduction)) + geom_density()

The variance is larger than the mean; data are currently over-dispersed (although it is possible that data are not over-dispersed conditional on predictors.)

  1. Overlay a Gaussian distribution of same mean and variance, does it fit?
normdens <- data.frame(x=seq(-10,max(repro$reproduction), by = 0.1), 
                       y=dnorm(x=seq(-10, max(repro$reproduction), by = 0.1),
                  mean = mean(repro$reproduction),
                  sd = sd(repro$reproduction)))

ggplot(repro, aes(x=reproduction)) + geom_histogram(aes(y=stat(count) / sum(count))) +
  geom_line(data=normdens, aes(x = x, y=y), inherit.aes = FALSE, color="red")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

  1. Fit an compare a lm and a Poisson glm of reproduction on size
lmrepro <- lm(reproduction ~ size, data=repro)
summary(lmrepro)
## 
## Call:
## lm(formula = reproduction ~ size, data = repro)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -11.314  -3.439  -1.120   1.013 158.288 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -4.0484     1.5900  -2.546   0.0114 *  
## size          1.8526     0.3515   5.270 2.62e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11.1 on 298 degrees of freedom
## Multiple R-squared:  0.08526,    Adjusted R-squared:  0.08219 
## F-statistic: 27.77 on 1 and 298 DF,  p-value: 2.62e-07
glmrepro <- glm(reproduction ~ size, data = repro, family = "quasipoisson")
summary(glmrepro)
## 
## Call:
## glm(formula = reproduction ~ size, family = "quasipoisson", data = repro)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -5.5644  -1.8016  -1.1823   0.1392  23.4704  
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.83080    0.31504  -2.637   0.0088 ** 
## size         0.42785    0.05075   8.431 1.49e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for quasipoisson family taken to be 12.3918)
## 
##     Null deviance: 2892.0  on 299  degrees of freedom
## Residual deviance: 2063.5  on 298  degrees of freedom
## AIC: NA
## 
## Number of Fisher Scoring iterations: 6
ggplot(repro, aes(x=size, y=reproduction)) + geom_point()+
  geom_smooth(method="lm",fullrange=TRUE, color="orange", fill="orange", alpha=0.2) +
  geom_smooth(method="glm", method.args=list(family="poisson"), color="red", fill="red", fullrange=TRUE, alpha=0.2) +
  geom_smooth(method="glm", method.args=list(family="quasipoisson"), color="blue", fill="blue", fullrange=TRUE, alpha=0.2) +
  ylim(c(-1,30)) + xlim(c(0,max(repro$size)))
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 3 rows containing non-finite values (stat_smooth).
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 3 rows containing non-finite values (stat_smooth).
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 3 rows containing non-finite values (stat_smooth).
## Warning: Removed 3 rows containing missing values (geom_point).

Let’s build the prediction line by hand to understand the back-transformation.

sizex <- seq(from=0, to=10, by=0.1)
reprolatent <- coef(glmrepro)[1] + sizex*coef(glmrepro)[2]
#equivalent to:
predict(glmrepro, type = "link", newdata = data.frame(size=sizex))
##           1           2           3           4           5           6 
## -0.83080074 -0.78801586 -0.74523099 -0.70244612 -0.65966124 -0.61687637 
##           7           8           9          10          11          12 
## -0.57409150 -0.53130663 -0.48852175 -0.44573688 -0.40295201 -0.36016713 
##          13          14          15          16          17          18 
## -0.31738226 -0.27459739 -0.23181252 -0.18902764 -0.14624277 -0.10345790 
##          19          20          21          22          23          24 
## -0.06067302 -0.01788815  0.02489672  0.06768160  0.11046647  0.15325134 
##          25          26          27          28          29          30 
##  0.19603621  0.23882109  0.28160596  0.32439083  0.36717571  0.40996058 
##          31          32          33          34          35          36 
##  0.45274545  0.49553033  0.53831520  0.58110007  0.62388494  0.66666982 
##          37          38          39          40          41          42 
##  0.70945469  0.75223956  0.79502444  0.83780931  0.88059418  0.92337905 
##          43          44          45          46          47          48 
##  0.96616393  1.00894880  1.05173367  1.09451855  1.13730342  1.18008829 
##          49          50          51          52          53          54 
##  1.22287317  1.26565804  1.30844291  1.35122778  1.39401266  1.43679753 
##          55          56          57          58          59          60 
##  1.47958240  1.52236728  1.56515215  1.60793702  1.65072189  1.69350677 
##          61          62          63          64          65          66 
##  1.73629164  1.77907651  1.82186139  1.86464626  1.90743113  1.95021601 
##          67          68          69          70          71          72 
##  1.99300088  2.03578575  2.07857062  2.12135550  2.16414037  2.20692524 
##          73          74          75          76          77          78 
##  2.24971012  2.29249499  2.33527986  2.37806473  2.42084961  2.46363448 
##          79          80          81          82          83          84 
##  2.50641935  2.54920423  2.59198910  2.63477397  2.67755885  2.72034372 
##          85          86          87          88          89          90 
##  2.76312859  2.80591346  2.84869834  2.89148321  2.93426808  2.97705296 
##          91          92          93          94          95          96 
##  3.01983783  3.06262270  3.10540757  3.14819245  3.19097732  3.23376219 
##          97          98          99         100         101 
##  3.27654707  3.31933194  3.36211681  3.40490169  3.44768656
pred1 <- data.frame(sizex=sizex, reprolatent=reprolatent)

ggplot(repro, aes(x=size, y=reproduction)) + geom_point() + 
  geom_line(data = pred1, aes(x=sizex, y=reprolatent), color="red")

pred1$reprodata <- exp(pred1$reprolatent)
#equivalent to:
predict(glmrepro, type = "response", newdata = data.frame(size=sizex))
##          1          2          3          4          5          6          7 
##  0.4357003  0.4547462  0.4746247  0.4953721  0.5170265  0.5396274  0.5632163 
##          8          9         10         11         12         13         14 
##  0.5878364  0.6135327  0.6403522  0.6683442  0.6975597  0.7280524  0.7598780 
##         15         16         17         18         19         20         21 
##  0.7930948  0.8277636  0.8639479  0.9017140  0.9411309  0.9822709  1.0252092 
##         22         23         24         25         26         27         28 
##  1.0700246  1.1167989  1.1656179  1.2165710  1.2697513  1.3252564  1.3831878 
##         29         30         31         32         33         34         35 
##  1.4436516  1.5067584  1.5726238  1.6413685  1.7131182  1.7880043  1.8661639 
##         36         37         38         39         40         41         42 
##  1.9477402  2.0328824  2.1217465  2.2144951  2.3112981  2.4123326  2.5177838 
##         43         44         45         46         47         48         49 
##  2.6278445  2.7427164  2.8626096  2.9877439  3.1183481  3.2546616  3.3969337 
##         50         51         52         53         54         55         56 
##  3.5454250  3.7004074  3.8621645  4.0309926  4.2072008  4.3911116  4.5830617 
##         57         58         59         60         61         62         63 
##  4.7834027  4.9925012  5.2107401  5.4385189  5.6762548  5.9243828  6.1833574 
##         64         65         66         67         68         69         70 
##  6.4536526  6.7357633  7.0302060  7.3375198  7.6582673  7.9930357  8.3424380 
##         71         72         73         74         75         76         77 
##  8.7071138  9.0877308  9.4849859  9.8996063 10.3323512 10.7840127 11.2554179 
##         78         79         80         81         82         83         84 
## 11.7474298 12.2609492 12.7969163 13.3563122 13.9401612 14.5495322 15.1855409 
##         85         86         87         88         89         90         91 
## 15.8493516 16.5421797 17.2652936 18.0200173 18.8077324 19.6298812 20.4879689 
##         92         93         94         95         96         97         98 
## 21.3835664 22.3183135 23.2939215 24.3121766 25.3749431 26.4841666 27.6418779 
##         99        100        101 
## 28.8501967 30.1113352 31.4276022
ggplot(repro, aes(x=size, y=reproduction)) + geom_point() + 
  geom_line(data = pred1, aes(x=sizex, y=reprolatent), color="red")+
  geom_line(data = pred1, aes(x=sizex, y=reprodata), color="blue")