This section shows a way to estimate statistical power for a t-test by simulating data. We start with a fixed sample size and fixed difference between the samples.
set.seed(123) # random seed to make code reproducible
samplesize <- 40
sampdiff <- 0.5
samp1 <- rnorm(n = samplesize, mean = 0, sd = 1) #random numbers following normal
#distribution of mean 0 and standard deviation 1
samp2 <- rnorm(n = samplesize, mean = sampdiff, sd = 1)
Our two simulated samples are samp1 and samp2. We “collected” 40 datum in each sample, and the real difference between them is 0.5. Maybe you could imagine that the things we are sampling are birds in two populations, and that the trait we are measuring is some standardized measure of behaviour (aggression index). Birds are more aggressive in the second population (with a difference of 0.5). We haven’t measured every individual in the two populations, but only a small fraction. The question now is, have we measured enough individuals to detect the difference between the two populations.
We perform a t-test comparing the two samples. To stay very basic, we will just look at the p-value from the test:
testres <- t.test(samp1, samp2)
testres$p.value
## [1] 0.0341346
That’s “significant”, we can distinguish the difference between the samples from noise.
Statistical power is the probability to observe a significant result under the present biological and experimental conditions (variance, difference, and sample size). To estimate statistical power, we can repeat the process of simulation and test many times, and use the frequency of significant results as an estimate of this probability.
nbsimul <- 1000
set.seed(123) # random seed to make code reproducible
samplesize <- 40
sampdiff <- 0.5
storepvalues <- vector(length = nbsimul)
for (i in 1:nbsimul)
{
samp1 <- rnorm(n = samplesize, mean = 0, sd = 1)
samp2 <- rnorm(n = samplesize, mean = sampdiff, sd = 1)
testres <- t.test(samp1, samp2)
storepvalues[i] <- testres$p.value # store the pvalue from this data set test
}
Statistical power is estimated as the proportion of significant results, that is p-values below our significance threshold (typically 5%), which we can compute as:
mean(storepvalues<0.05)
## [1] 0.598
So the power is estimated to be 59.8% for a sample size of 40 per sample, a difference of 0.5 between the samples, and a defaut t-test.
How many individuals should we collect to obtain a power of at least 99% ?
hint: there is an analytical solution in this simple case, but I would prefer you to use one or two for loops (a while loop could be useful too!)!
Also, try and error is an acceptable solution.
A very crude solution is to loop over all sample sizes, one by one, until we reach a power of 0.99. A while loop is the right tool when you want to interrupt a loop based on a condition. We will also record all the realized powers so that we can plot them as a function of sample size afterwards.
nbsimul <- 1000
set.seed(123) # random seed to make code reproducible
samplesize <- 40
pcont <- vector()
j <- 1
power <- 0.596
sampdiff <- 0.5
while(power<0.99)
{
print(x = paste0("Sample size ", samplesize, " is not sufficient, power is only ", power))
samplesize <- samplesize + 1
storepvalues <- vector(length = nbsimul)
for (i in 1:nbsimul)
{
samp1 <- rnorm(n = samplesize, mean = 0, sd = 1)
samp2 <- rnorm(n = samplesize, mean = sampdiff, sd = 1)
testres <- t.test(samp1, samp2)
storepvalues[i] <- testres$p.value # store the pvalue from this data set test
}
power <- mean(storepvalues<0.05)
j <- j+1
pcont[j] <- power
}
## [1] "Sample size 40 is not sufficient, power is only 0.596"
## [1] "Sample size 41 is not sufficient, power is only 0.615"
## [1] "Sample size 42 is not sufficient, power is only 0.605"
## [1] "Sample size 43 is not sufficient, power is only 0.64"
## [1] "Sample size 44 is not sufficient, power is only 0.627"
## [1] "Sample size 45 is not sufficient, power is only 0.656"
## [1] "Sample size 46 is not sufficient, power is only 0.662"
## [1] "Sample size 47 is not sufficient, power is only 0.671"
## [1] "Sample size 48 is not sufficient, power is only 0.656"
## [1] "Sample size 49 is not sufficient, power is only 0.698"
## [1] "Sample size 50 is not sufficient, power is only 0.709"
## [1] "Sample size 51 is not sufficient, power is only 0.721"
## [1] "Sample size 52 is not sufficient, power is only 0.689"
## [1] "Sample size 53 is not sufficient, power is only 0.726"
## [1] "Sample size 54 is not sufficient, power is only 0.703"
## [1] "Sample size 55 is not sufficient, power is only 0.735"
## [1] "Sample size 56 is not sufficient, power is only 0.722"
## [1] "Sample size 57 is not sufficient, power is only 0.754"
## [1] "Sample size 58 is not sufficient, power is only 0.773"
## [1] "Sample size 59 is not sufficient, power is only 0.78"
## [1] "Sample size 60 is not sufficient, power is only 0.779"
## [1] "Sample size 61 is not sufficient, power is only 0.78"
## [1] "Sample size 62 is not sufficient, power is only 0.795"
## [1] "Sample size 63 is not sufficient, power is only 0.784"
## [1] "Sample size 64 is not sufficient, power is only 0.789"
## [1] "Sample size 65 is not sufficient, power is only 0.823"
## [1] "Sample size 66 is not sufficient, power is only 0.832"
## [1] "Sample size 67 is not sufficient, power is only 0.839"
## [1] "Sample size 68 is not sufficient, power is only 0.821"
## [1] "Sample size 69 is not sufficient, power is only 0.82"
## [1] "Sample size 70 is not sufficient, power is only 0.85"
## [1] "Sample size 71 is not sufficient, power is only 0.85"
## [1] "Sample size 72 is not sufficient, power is only 0.838"
## [1] "Sample size 73 is not sufficient, power is only 0.842"
## [1] "Sample size 74 is not sufficient, power is only 0.832"
## [1] "Sample size 75 is not sufficient, power is only 0.872"
## [1] "Sample size 76 is not sufficient, power is only 0.859"
## [1] "Sample size 77 is not sufficient, power is only 0.852"
## [1] "Sample size 78 is not sufficient, power is only 0.848"
## [1] "Sample size 79 is not sufficient, power is only 0.885"
## [1] "Sample size 80 is not sufficient, power is only 0.896"
## [1] "Sample size 81 is not sufficient, power is only 0.872"
## [1] "Sample size 82 is not sufficient, power is only 0.88"
## [1] "Sample size 83 is not sufficient, power is only 0.9"
## [1] "Sample size 84 is not sufficient, power is only 0.874"
## [1] "Sample size 85 is not sufficient, power is only 0.906"
## [1] "Sample size 86 is not sufficient, power is only 0.892"
## [1] "Sample size 87 is not sufficient, power is only 0.918"
## [1] "Sample size 88 is not sufficient, power is only 0.916"
## [1] "Sample size 89 is not sufficient, power is only 0.894"
## [1] "Sample size 90 is not sufficient, power is only 0.906"
## [1] "Sample size 91 is not sufficient, power is only 0.908"
## [1] "Sample size 92 is not sufficient, power is only 0.913"
## [1] "Sample size 93 is not sufficient, power is only 0.93"
## [1] "Sample size 94 is not sufficient, power is only 0.925"
## [1] "Sample size 95 is not sufficient, power is only 0.929"
## [1] "Sample size 96 is not sufficient, power is only 0.941"
## [1] "Sample size 97 is not sufficient, power is only 0.925"
## [1] "Sample size 98 is not sufficient, power is only 0.92"
## [1] "Sample size 99 is not sufficient, power is only 0.936"
## [1] "Sample size 100 is not sufficient, power is only 0.951"
## [1] "Sample size 101 is not sufficient, power is only 0.934"
## [1] "Sample size 102 is not sufficient, power is only 0.96"
## [1] "Sample size 103 is not sufficient, power is only 0.944"
## [1] "Sample size 104 is not sufficient, power is only 0.954"
## [1] "Sample size 105 is not sufficient, power is only 0.95"
## [1] "Sample size 106 is not sufficient, power is only 0.959"
## [1] "Sample size 107 is not sufficient, power is only 0.943"
## [1] "Sample size 108 is not sufficient, power is only 0.951"
## [1] "Sample size 109 is not sufficient, power is only 0.966"
## [1] "Sample size 110 is not sufficient, power is only 0.955"
## [1] "Sample size 111 is not sufficient, power is only 0.958"
## [1] "Sample size 112 is not sufficient, power is only 0.964"
## [1] "Sample size 113 is not sufficient, power is only 0.966"
## [1] "Sample size 114 is not sufficient, power is only 0.972"
## [1] "Sample size 115 is not sufficient, power is only 0.964"
## [1] "Sample size 116 is not sufficient, power is only 0.966"
## [1] "Sample size 117 is not sufficient, power is only 0.977"
## [1] "Sample size 118 is not sufficient, power is only 0.977"
## [1] "Sample size 119 is not sufficient, power is only 0.972"
## [1] "Sample size 120 is not sufficient, power is only 0.969"
## [1] "Sample size 121 is not sufficient, power is only 0.975"
## [1] "Sample size 122 is not sufficient, power is only 0.979"
## [1] "Sample size 123 is not sufficient, power is only 0.975"
## [1] "Sample size 124 is not sufficient, power is only 0.985"
## [1] "Sample size 125 is not sufficient, power is only 0.984"
## [1] "Sample size 126 is not sufficient, power is only 0.976"
## [1] "Sample size 127 is not sufficient, power is only 0.974"
## [1] "Sample size 128 is not sufficient, power is only 0.982"
## [1] "Sample size 129 is not sufficient, power is only 0.976"
## [1] "Sample size 130 is not sufficient, power is only 0.979"
## [1] "Sample size 131 is not sufficient, power is only 0.98"
## [1] "Sample size 132 is not sufficient, power is only 0.98"
## [1] "Sample size 133 is not sufficient, power is only 0.979"
## [1] "Sample size 134 is not sufficient, power is only 0.977"
## [1] "Sample size 135 is not sufficient, power is only 0.979"
## [1] "Sample size 136 is not sufficient, power is only 0.987"
## [1] "Sample size 137 is not sufficient, power is only 0.979"
## [1] "Sample size 138 is not sufficient, power is only 0.981"
## [1] "Sample size 139 is not sufficient, power is only 0.986"
## [1] "Sample size 140 is not sufficient, power is only 0.983"
## [1] "Sample size 141 is not sufficient, power is only 0.986"
## [1] "Sample size 142 is not sufficient, power is only 0.986"
samplesize
## [1] 143
plot(pcont, x=40:samplesize)
With 143 samples per group we are almost certain to detect an effect of this magnitude.
In practice, we could get almost as much information while testing far fewer sample sizes. Let’s just play it safe and take a broad range of values:
valuestotest <- c(10, 50, 100, 150, 200, 250, 300, 350, 400)
pcont2 <- vector(length = length(valuestotest))
power <- 0
for (j in 1:length(valuestotest))
{
samplesize <- valuestotest[j]
storepvalues <- vector(length = nbsimul)
for (i in 1:nbsimul)
{
samp1 <- rnorm(n = samplesize, mean = 0, sd = 1)
samp2 <- rnorm(n = samplesize, mean = sampdiff, sd = 1)
testres <- t.test(samp1, samp2)
storepvalues[i] <- testres$p.value # store the pvalue from this data set test
}
power <- mean(storepvalues<0.05)
pcont2[j] <- power
}
plot(pcont2, x=valuestotest, type="b")
abline(h=0.99, col="red")
library(pwr)
valuestotest <- c(10, 50, 100, 150, 200, 250, 300, 350, 400)
pcont2 <- vector(length = length(valuestotest))
power <- 0
for (j in 1:length(valuestotest))
{
pcont2[j] <- pwr.t.test(n=valuestotest[j], d = 0.5)$power
}
plot(pcont2, x=valuestotest, type="b")
abline(h=0.99, col="red")
We will calculate power for a set of sample sizes:
samplesizestotest <- seq(from=10, to =200, by = 20)
crossed with a set of effect sizes:
effectsizestotest <- seq(from=0, to = 1, by = 0.1)
We create a data frame that contains all combinations of these two sets using the function expand.grid:
container <- expand.grid(samplesize= samplesizestotest, effectsize=effectsizestotest)
We add a column that will contain power estimates:
container$power <- NA
head(container)
## samplesize effectsize power
## 1 10 0 NA
## 2 30 0 NA
## 3 50 0 NA
## 4 70 0 NA
## 5 90 0 NA
## 6 110 0 NA
Now we calculate power for every line of the container using a for loop (we use “potatoes” as an index just to demonstrate that the name of the index doesn’t have to be always “i”)
library(pwr)
for(potatoes in 1:nrow(container))
{
container$power[potatoes] <-pwr.t.test(n = container$samplesize[potatoes], d = container$effectsize[potatoes])$power
}
We can visualize the results with:
plot(x = container$samplesize[container$effectsize==effectsizestotest[1]], y=container$power[container$effectsize==effectsizestotest[1]], ylim = c(0,1), type="l", xlab="Sample size", ylab="Power")
effectsizestotest[1]
## [1] 0
for(radish in 2:length(effectsizestotest))
{
lines(x = container$samplesize[container$effectsize==effectsizestotest[radish]],
y=container$power[container$effectsize==effectsizestotest[radish]], type="l", col= rgb(red = effectsizestotest[radish],green = 0, blue = 0, maxColorValue = max(effectsizestotest)))
}
legend(x="bottomright", legend = rev(effectsizestotest), col = rev(rgb(red = effectsizestotest,0,0, maxColorValue = max(effectsizestotest))), lty=1, title = "Effect size")
or using ggplot and also considering several values of unexplained variation (thanks Kevin!)
library(tidyverse)
## ── Attaching packages ───────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 2.2.1 ✔ purrr 0.2.4
## ✔ tibble 1.4.2 ✔ dplyr 0.7.5
## ✔ tidyr 0.8.1 ✔ stringr 1.3.1
## ✔ readr 1.1.1 ✔ forcats 0.3.0
## ── Conflicts ──────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
power.t = function (diff=1, u=0, sd=1, n=100, iters=10000) {
data.frame(i=1:n) %>%
mutate(pval=map_dbl(i, function (x, diff, u, sd, n) {
a = rnorm(n, u, sd)
b = rnorm(n, u+diff, sd)
t = t.test(a, b)
t$p.value
}, diff, u, sd, n)) %>%
group_by() %>%
summarise(power=mean(pval < 0.05)) %>%
pull(power)
}
powerVnVdiff = expand.grid(n=c(3, 10, 20, 50, 100, 200, 500, 1000),
diff=seq(0, 1, 0.1), sd=c(0.5, 1:3)) %>%
mutate(power=pmap_dbl(list(diff=diff, n=n, sd=sd), power.t, iters=1000))
ggplot(powerVnVdiff, aes(x=n, y=power)) +
geom_line(aes(colour=as.factor(diff))) +
scale_colour_discrete(name="Effect Size") +
facet_wrap(~paste0("SD = ", sd), ncol=1) +
scale_x_log10() +
theme_bw()
library(pwr)
pwrVn_eff = expand.grid(n=seq(10, 200, 20), eff=seq(0,1,0.1)) %>%
mutate(power=pmap(list(n=n, d=eff), pwr.t.test),
power=map_dbl(power, "power"))
head(pwrVn_eff)
## n eff power
## 1 10 0 0.05
## 2 30 0 0.05
## 3 50 0 0.05
## 4 70 0 0.05
## 5 90 0 0.05
## 6 110 0 0.05
ggplot(pwrVn_eff, aes(x=n, y=power)) +
geom_line(aes(colour=as.factor(eff))) +
scale_color_discrete(name="Effect Size") +
labs(x="Sample Size", y="Power") +
theme_bw()
Load data and package
library(lme4)
## Loading required package: Matrix
##
## Attaching package: 'Matrix'
## The following object is masked from 'package:tidyr':
##
## expand
repeateddata <- read.csv(file = "RepeatedMeasurements.csv")
inddata <- read.csv(file = "IndividualMeasurements.csv")
Fit lm() on individual data
summary(lm(value ~ 1 + group, data = inddata))
##
## Call:
## lm(formula = value ~ 1 + group, data = inddata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.06043 -0.68440 0.09654 0.64552 1.78629
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.1322 0.4565 -0.290 0.774
## group 0.2845 0.2887 0.985 0.331
##
## Residual standard error: 0.9131 on 38 degrees of freedom
## Multiple R-squared: 0.02491, Adjusted R-squared: -0.0007515
## F-statistic: 0.9707 on 1 and 38 DF, p-value: 0.3307
Fit lm() on repeated data
summary(lm(value ~ 1 + group, data = repeateddata))
##
## Call:
## lm(formula = value ~ 1 + group, data = repeateddata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.26198 -0.60932 -0.03992 0.67202 1.84037
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.1693 0.1395 -1.214 0.225384
## group 0.3114 0.0882 3.531 0.000463 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.882 on 398 degrees of freedom
## Multiple R-squared: 0.03037, Adjusted R-squared: 0.02793
## F-statistic: 12.47 on 1 and 398 DF, p-value: 0.0004631
```
The standard error is too small!
Fit lmer() on repeated data
summary(lmer(value ~ 1 + group + (1 | individual), data=repeateddata))
## Linear mixed model fit by REML ['lmerMod']
## Formula: value ~ 1 + group + (1 | individual)
## Data: repeateddata
##
## REML criterion at convergence: -435.7
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -2.89405 -0.62776 -0.01201 0.61482 2.92925
##
## Random effects:
## Groups Name Variance Std.Dev.
## individual (Intercept) 0.80424 0.8968
## Residual 0.01008 0.1004
## Number of obs: 400, groups: individual, 40
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) -0.1693 0.4487 -0.377
## group 0.3114 0.2838 1.097
##
## Correlation of Fixed Effects:
## (Intr)
## group -0.949
We find again the same estimate, and the standard error is corrected: we do not find a significant effect.