Today:

• Fit generalized linear models for count data
• Predict and visualise model output
• Understand issues of over-dispersion and how to address them

Packages:

library(ggplot2)
library(performance)
library(DHARMa)
library(glmmTMB)
## Registered S3 method overwritten by 'glmmTMB':
##   method        from
##   refit.glmmTMB DHARMa

Data:

download.file("https://timotheenivalis.github.io/data/reproduction.csv",
destfile = "data/reproduction.csv")
destfile = "data/voles.csv")

## Count data properties

Ideal count data produced by “Poisson” process. The Poisson distribution has a single parameter: the expected count, lambda.

randomcount <- rpois(n = 10000, lambda = 3.2)

mean(randomcount)
## [1] 3.1782
var(randomcount)
## [1] 3.136758
range(randomcount)
## [1]  0 11
summary(randomcount)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
##   0.000   2.000   3.000   3.178   4.000  11.000
randomcount <- rpois(n = 10000, lambda = 10.8)

mean(randomcount)
## [1] 10.7749
var(randomcount)
## [1] 10.84571
range(randomcount)
## [1]  1 24
summary(randomcount)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
##    1.00    8.00   11.00   10.77   13.00   24.00
ggplot(data.frame(x=randomcount), aes(x=x))+
geom_histogram(binwidth = 0.5,aes(y=stat(count) / sum(count)))

The mean of a true Poisson variable equals its variance. More in general in count variables that are not perfect Poisson, the variance is a function of the mean.

x <- seq(from = 1, to=10, by=1)

dat <- data.frame(x= rep(x, each=1000),
y=as.vector(sapply(x, function(x) {rpois(n = 1000, lambda = x)})))

ggplot(dat, aes(x=as.factor(x), y=y)) + geom_boxplot()

## Fitting count data GLMs in R

The three components of a glm:

1. A linear model: response = intercept + predictor*slope …
2. A link function between the scale of the data (positive intergers) and the scale of the linear model (-inf to +inf). For count data generally log().
3. A random distribution linking expected data (on average 1.27) to actual data (0,1,2,3…). For count data, the Poisson distribution or other related distributions.

The simplest solution is:

glm(obs ~ 1 + x , family = "poisson", data=data)

However you should NEVER use this. glm( family=“poisson”) assumes there is no unexplained variation apart from the incompressible Poisson random process, that is, there is no unexplained variation in expected counts, or $$V(\mathrm{exp}(Y)) = E(\mathrm{exp}(Y))$$. This is almost never the case with biological data. Most of the time there is lots of unexplained variance and are over-dispersed. On the other hand, some biological processes are “conservative”/under-dispersed and have $$V(\mathrm{exp}(y)) < E(\mathrm{exp}(y))$$ (for instance bird clutch size within species).

When you use simple Poisson glm on over-dispersed data you artificially decrease SE and p-value: you are going to find lots of false positive. When you use simple Poisson glm on under-dispersed data you artificially increase SE and p-value: you loose statistical power.

There are at least three options to obtain proper inference/prediction with count data:

• Use quasi-likelihood methods with approximate correction of uncertainty. That is available as glm(family=“quasipoisson”). Easy but not exact.
• Use generalized Poisson families such as the negative-binomial.
• If the data are assumed to be over-dispersed conditional on predictors, use an “observation-level random effect”; that’s equivalent to adding a residual to the Poisson model.

## Practice

1. Load the data reproduction.csv
repro <- read.csv("data/reproduction.csv")
str(repro)
## 'data.frame':    300 obs. of  3 variables:
##  $reproduction: int 0 0 3 1 0 3 1 10 0 3 ... ##$ size        : num  1.59 4.55 6.17 2.69 4.86 ...
##  $sex : int 0 1 0 0 0 1 0 1 0 0 ... summary(repro) ## reproduction size sex ## Min. : 0.00 Min. : 1.027 Min. :0.0000 ## 1st Qu.: 0.00 1st Qu.: 2.735 1st Qu.:0.0000 ## Median : 1.00 Median : 4.001 Median :1.0000 ## Mean : 3.62 Mean : 4.139 Mean :0.5033 ## 3rd Qu.: 3.00 3rd Qu.: 5.279 3rd Qu.:1.0000 ## Max. :169.00 Max. :10.088 Max. :1.0000 1. Plot reproduction data, calculate the mean and variance in reproduction. mean(repro$reproduction)
## [1] 3.62
var(repro$reproduction) ## [1] 134.2565 ggplot(repro, aes(x=reproduction)) + geom_density() The variance is larger than the mean; data are currently over-dispersed (although it is possible that data are not over-dispersed conditional on predictors.) 1. Overlay a Gaussian distribution of same mean and variance, does it fit? normdens <- data.frame(x=seq(-10,max(repro$reproduction), by = 0.1),
y=dnorm(x=seq(-10, max(repro$reproduction), by = 0.1), mean = mean(repro$reproduction),
sd = sd(repro$reproduction))) ggplot(repro, aes(x=reproduction)) + geom_histogram(aes(y=stat(count) / sum(count))) + geom_line(data=normdens, aes(x = x, y=y), inherit.aes = FALSE, color="red") ## stat_bin() using bins = 30. Pick better value with binwidth. 1. Fit an compare a lm and a Poisson glm of reproduction on size lmrepro <- lm(reproduction ~ size, data=repro) summary(lmrepro) ## ## Call: ## lm(formula = reproduction ~ size, data = repro) ## ## Residuals: ## Min 1Q Median 3Q Max ## -11.314 -3.439 -1.120 1.013 158.288 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -4.0484 1.5900 -2.546 0.0114 * ## size 1.8526 0.3515 5.270 2.62e-07 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 11.1 on 298 degrees of freedom ## Multiple R-squared: 0.08526, Adjusted R-squared: 0.08219 ## F-statistic: 27.77 on 1 and 298 DF, p-value: 2.62e-07 glmrepro <- glm(reproduction ~ size, data = repro, family = "quasipoisson") summary(glmrepro) ## ## Call: ## glm(formula = reproduction ~ size, family = "quasipoisson", data = repro) ## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -5.5644 -1.8016 -1.1823 0.1392 23.4704 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -0.83080 0.31504 -2.637 0.0088 ** ## size 0.42785 0.05075 8.431 1.49e-15 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## (Dispersion parameter for quasipoisson family taken to be 12.3918) ## ## Null deviance: 2892.0 on 299 degrees of freedom ## Residual deviance: 2063.5 on 298 degrees of freedom ## AIC: NA ## ## Number of Fisher Scoring iterations: 6 ggplot(repro, aes(x=size, y=reproduction)) + geom_point()+ geom_smooth(method="lm",fullrange=TRUE, color="orange", fill="orange", alpha=0.2) + geom_smooth(method="glm", method.args=list(family="poisson"), color="red", fill="red", fullrange=TRUE, alpha=0.2) + geom_smooth(method="glm", method.args=list(family="quasipoisson"), color="blue", fill="blue", fullrange=TRUE, alpha=0.2) + ylim(c(-1,30)) + xlim(c(0,max(repro$size)))
## geom_smooth() using formula 'y ~ x'
## Warning: Removed 3 rows containing non-finite values (stat_smooth).
## geom_smooth() using formula 'y ~ x'
## Warning: Removed 3 rows containing non-finite values (stat_smooth).
## geom_smooth() using formula 'y ~ x'
## Warning: Removed 3 rows containing non-finite values (stat_smooth).
## Warning: Removed 3 rows containing missing values (geom_point).

Let’s build the prediction line by hand to understand the back-transformation.

sizex <- seq(from=0, to=10, by=0.1)
reprolatent <- coef(glmrepro)[1] + sizex*coef(glmrepro)[2]
#equivalent to:
predict(glmrepro, type = "link", newdata = data.frame(size=sizex))
##           1           2           3           4           5           6
## -0.83080074 -0.78801586 -0.74523099 -0.70244612 -0.65966124 -0.61687637
##           7           8           9          10          11          12
## -0.57409150 -0.53130663 -0.48852175 -0.44573688 -0.40295201 -0.36016713
##          13          14          15          16          17          18
## -0.31738226 -0.27459739 -0.23181252 -0.18902764 -0.14624277 -0.10345790
##          19          20          21          22          23          24
## -0.06067302 -0.01788815  0.02489672  0.06768160  0.11046647  0.15325134
##          25          26          27          28          29          30
##  0.19603621  0.23882109  0.28160596  0.32439083  0.36717571  0.40996058
##          31          32          33          34          35          36
##  0.45274545  0.49553033  0.53831520  0.58110007  0.62388494  0.66666982
##          37          38          39          40          41          42
##  0.70945469  0.75223956  0.79502444  0.83780931  0.88059418  0.92337905
##          43          44          45          46          47          48
##  0.96616393  1.00894880  1.05173367  1.09451855  1.13730342  1.18008829
##          49          50          51          52          53          54
##  1.22287317  1.26565804  1.30844291  1.35122778  1.39401266  1.43679753
##          55          56          57          58          59          60
##  1.47958240  1.52236728  1.56515215  1.60793702  1.65072189  1.69350677
##          61          62          63          64          65          66
##  1.73629164  1.77907651  1.82186139  1.86464626  1.90743113  1.95021601
##          67          68          69          70          71          72
##  1.99300088  2.03578575  2.07857062  2.12135550  2.16414037  2.20692524
##          73          74          75          76          77          78
##  2.24971012  2.29249499  2.33527986  2.37806473  2.42084961  2.46363448
##          79          80          81          82          83          84
##  2.50641935  2.54920423  2.59198910  2.63477397  2.67755885  2.72034372
##          85          86          87          88          89          90
##  2.76312859  2.80591346  2.84869834  2.89148321  2.93426808  2.97705296
##          91          92          93          94          95          96
##  3.01983783  3.06262270  3.10540757  3.14819245  3.19097732  3.23376219
##          97          98          99         100         101
##  3.27654707  3.31933194  3.36211681  3.40490169  3.44768656
pred1 <- data.frame(sizex=sizex, reprolatent=reprolatent)

ggplot(repro, aes(x=size, y=reproduction)) + geom_point() +
geom_line(data = pred1, aes(x=sizex, y=reprolatent), color="red")

pred1$reprodata <- exp(pred1$reprolatent)
#equivalent to:
predict(glmrepro, type = "response", newdata = data.frame(size=sizex))
##          1          2          3          4          5          6          7
##  0.4357003  0.4547462  0.4746247  0.4953721  0.5170265  0.5396274  0.5632163
##          8          9         10         11         12         13         14
##  0.5878364  0.6135327  0.6403522  0.6683442  0.6975597  0.7280524  0.7598780
##         15         16         17         18         19         20         21
##  0.7930948  0.8277636  0.8639479  0.9017140  0.9411309  0.9822709  1.0252092
##         22         23         24         25         26         27         28
##  1.0700246  1.1167989  1.1656179  1.2165710  1.2697513  1.3252564  1.3831878
##         29         30         31         32         33         34         35
##  1.4436516  1.5067584  1.5726238  1.6413685  1.7131182  1.7880043  1.8661639
##         36         37         38         39         40         41         42
##  1.9477402  2.0328824  2.1217465  2.2144951  2.3112981  2.4123326  2.5177838
##         43         44         45         46         47         48         49
##  2.6278445  2.7427164  2.8626096  2.9877439  3.1183481  3.2546616  3.3969337
##         50         51         52         53         54         55         56
##  3.5454250  3.7004074  3.8621645  4.0309926  4.2072008  4.3911116  4.5830617
##         57         58         59         60         61         62         63
##  4.7834027  4.9925012  5.2107401  5.4385189  5.6762548  5.9243828  6.1833574
##         64         65         66         67         68         69         70
##  6.4536526  6.7357633  7.0302060  7.3375198  7.6582673  7.9930357  8.3424380
##         71         72         73         74         75         76         77
##  8.7071138  9.0877308  9.4849859  9.8996063 10.3323512 10.7840127 11.2554179
##         78         79         80         81         82         83         84
## 11.7474298 12.2609492 12.7969163 13.3563122 13.9401612 14.5495322 15.1855409
##         85         86         87         88         89         90         91
## 15.8493516 16.5421797 17.2652936 18.0200173 18.8077324 19.6298812 20.4879689
##         92         93         94         95         96         97         98
## 21.3835664 22.3183135 23.2939215 24.3121766 25.3749431 26.4841666 27.6418779
##         99        100        101
## 28.8501967 30.1113352 31.4276022
ggplot(repro, aes(x=size, y=reproduction)) + geom_point() +
geom_line(data = pred1, aes(x=sizex, y=reprolatent), color="red")+
geom_line(data = pred1, aes(x=sizex, y=reprodata), color="blue")