Unreliability example for paired-samples \(t\)-test

Author

Phil Chalmers

This basic example demonstrates the influence of unreliable measurements has on the empirical power to reject a null hypothesis of no effect for paired-sample \(t\)-test applications. Four reliability ratios (rxx) are investigated alongside four sample sizes (50, 100, 200, 400). Three non-zero effect sizes are included using Cohen’s d to evaluate the empirical power.

Simulation code

library(SimDesign)

Design <- createDesign(rxx=c(1, .8, .6, .4),
                       N = c(50, 100, 200, 400),
                       dT = c(0, .2, .5, .8)) # difference in true-scores

# (Optional) get the Rstudio IDE flags to avoid false positives in the source code
Attach(Design, RStudio_flags = TRUE)
# !diagnostics suppress=rxx,N,dT
# !diagnostics suppress=rxx,N,dT

Scratch algebra notes for obtaining \(VAR(E)\) given \(rxx\) and \(VAR(T)\).

\[rxx = VAR(T) / VAR(X) = VAR(T) / (VAR(T) + VAR(E))\] \[rxx * (VAR(T) + VAR(E)) = VAR(T)\] \[rxx * VAR(T) + rxx*VAR(E) = VAR(T)\] \[rxx * VAR(E) = VAR(T) - rxx*VAR(T)\] \[VAR(E) = VAR(T)/rxx - VAR(T)\] \[VAR(E) = VAR(T) * (1/rxx - 1)\]

Generate <- function(condition, fixed_objects = NULL) {
    Attach(condition)
    group <- gl(2, N, labels = c('pre', 'post'))
    true.scores <- c(rnorm(N, mean=dT), rnorm(N))
    # E(T) = E(mu_i)
    mu.t <- (0 + dT)/2
    # VAR(T) = E(var_i) + mu^2 = E(s_i^2 + mu_i^2) + mu^2
    var.t <- .5 * (1^2 + 0^2) + .5 * (1^2 + dT^2) - mu.t^2
    var.e <- var.t * (1/rxx - 1)
    X <- true.scores + rnorm(N*2, sd=sqrt(var.e))
    dat <- data.frame(pre=X[group == 'pre'], 
                      post=X[group == 'post'])
    dat
}

Analyse <- function(condition, dat, fixed_objects = NULL) {
    out <- with(dat, t.test(pre, post, paired=TRUE))
    # return p-value, mean difference, and SE
    nc(p=out$p.value, mean_diff=out$estimate, SE=out$stderr)
}

Summarise <- function(condition, results, fixed_objects = NULL) {
    power <- EDR(results[,"p"])
    means <- colMeans(subset(results, select=mean_diff:SE))
    c(power=power, means)
}
res <- runSimulation(design=Design, replications=5000, generate=Generate,
                     analyse=Analyse, summarise=Summarise, parallel=TRUE)
Warning in system2("quarto", "-V", stdout = TRUE, env = paste0("TMPDIR=", :
running command '"quarto"
TMPDIR=C:/Users/chalmrp/AppData/Local/Temp/RtmpMF9UtB/file797c38874cde -V' had
status 1
res
# A tibble: 64 × 10
     rxx     N    dT  power   mean_diff       SE REPLICATIONS SIM_TIME      SEED
   <dbl> <dbl> <dbl>  <dbl>       <dbl>    <dbl>        <dbl> <chr>        <int>
 1   1      50     0 0.055   0.0036951  0.19958          5000 0.20s     1.0933e9
 2   0.8    50     0 0.0522 -0.0019380  0.22233          5000 0.17s     1.1974e9
 3   0.6    50     0 0.0536  0.00030942 0.25673          5000 0.17s     1.9321e9
 4   0.4    50     0 0.052   0.0045053  0.31562          5000 0.17s     8.5535e8
 5   1     100     0 0.0556  0.0027358  0.14132          5000 0.40s     8.2160e8
 6   0.8   100     0 0.052  -0.0013261  0.15766          5000 0.19s     1.4461e9
 7   0.6   100     0 0.0476  0.00030481 0.18201          5000 0.17s     1.7148e9
 8   0.4   100     0 0.052  -0.0017916  0.22297          5000 0.17s     8.0936e8
 9   1     200     0 0.0484 -0.0019795  0.099886         5000 0.19s     1.5565e9
10   0.8   200     0 0.0536  0.0013104  0.11157          5000 0.18s     4.6978e8
# ℹ 54 more rows
# ℹ 1 more variable: COMPLETED <chr>

Results

Below is summary information in the form of tables and plots to demonstrate the negative effect of unreliability.

library(dplyr)

Attaching package: 'dplyr'
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union
res |> select(rxx:SE, -power) |>
    mutate(bias=dT-mean_diff) |>
    arrange(N, dT, rxx) |>
    select(-mean_diff) |> knitr::kable()
rxx N dT SE bias
0.4 50 0.0 0.316 -0.005
0.6 50 0.0 0.257 0.000
0.8 50 0.0 0.222 0.002
1.0 50 0.0 0.200 -0.004
0.4 50 0.2 0.315 0.005
0.6 50 0.2 0.257 -0.004
0.8 50 0.2 0.222 -0.002
1.0 50 0.2 0.199 0.001
0.4 50 0.5 0.321 0.003
0.6 50 0.5 0.260 -0.001
0.8 50 0.5 0.223 0.002
1.0 50 0.5 0.199 0.001
0.4 50 0.8 0.329 -0.006
0.6 50 0.8 0.265 -0.002
0.8 50 0.8 0.226 -0.001
1.0 50 0.8 0.199 0.001
0.4 100 0.0 0.223 0.002
0.6 100 0.0 0.182 0.000
0.8 100 0.0 0.158 0.001
1.0 100 0.0 0.141 -0.003
0.4 100 0.2 0.224 -0.002
0.6 100 0.2 0.182 0.000
0.8 100 0.2 0.158 -0.001
1.0 100 0.2 0.141 -0.001
0.4 100 0.5 0.227 0.005
0.6 100 0.5 0.184 0.001
0.8 100 0.5 0.159 -0.002
1.0 100 0.5 0.141 -0.001
0.4 100 0.8 0.233 -0.001
0.6 100 0.8 0.188 -0.002
0.8 100 0.8 0.160 -0.001
1.0 100 0.8 0.141 0.003
0.4 200 0.0 0.158 0.001
0.6 200 0.0 0.129 -0.001
0.8 200 0.0 0.112 -0.001
1.0 200 0.0 0.100 0.002
0.4 200 0.2 0.158 0.000
0.6 200 0.2 0.129 0.002
0.8 200 0.2 0.112 0.002
1.0 200 0.2 0.100 0.002
0.4 200 0.5 0.161 0.000
0.6 200 0.5 0.131 -0.002
0.8 200 0.5 0.112 -0.001
1.0 200 0.5 0.100 0.000
0.4 200 0.8 0.165 0.002
0.6 200 0.8 0.133 0.000
0.8 200 0.8 0.113 0.002
1.0 200 0.8 0.100 -0.002
0.4 400 0.0 0.112 0.000
0.6 400 0.0 0.091 0.001
0.8 400 0.0 0.079 0.000
1.0 400 0.0 0.071 0.001
0.4 400 0.2 0.112 0.001
0.6 400 0.2 0.091 0.001
0.8 400 0.2 0.079 -0.001
1.0 400 0.2 0.071 -0.001
0.4 400 0.5 0.114 0.001
0.6 400 0.5 0.092 0.001
0.8 400 0.5 0.079 -0.001
1.0 400 0.5 0.071 -0.001
0.4 400 0.8 0.117 0.000
0.6 400 0.8 0.094 0.002
0.8 400 0.8 0.080 0.002
1.0 400 0.8 0.071 -0.001

Estimates of the mean differences reflect the true difference between the pre-post observations regardless of the reliability (unbiased), however the associated SE is larger for more unreliable instruments. This is because the within-subject variability is larger as now it is a function of individual differences variance plus the variance of the measurement error. This has direct implications on power to reject the null hypothesis of no difference between the pre-post tests, as demonstrated below.

For the paired samples \(t\)-test:

library(ggplot2)
ggplot(res, aes(dT, power, colour=factor(rxx))) +
    geom_point() + geom_line() +
    geom_abline(slope = 0, intercept = .05, linetype = 'dashed') +
    xlab('Effect size (d)') + ylab('Detection Rate') + ggtitle('Empirical Power Curves') +
    scale_color_discrete('Reliability') + facet_wrap(~N) +
    ggtitle('Paired-samples t-test')

Solving sample size given reliability

Alternatively, one can try to estimate the required sample size to achieve a specific power of interest (e.g., \(1-\beta=.80\)) for more direct comparisons. This can be achieved using the SimSolve() function in SimDeisgn, or by the related package Spower which will use SimSolve() for solving stochastic root problems in the context of power analyses. Below is an example of the former.

DesignNA <- createDesign(rxx=c(1, .8, .6, .4),
                         N = NA,
                         dT = c(.2, .5, .8)) # difference in true-scores
DesignNA
# A tibble: 12 × 3
     rxx     N    dT
   <dbl> <dbl> <dbl>
 1   1      NA   0.2
 2   0.8    NA   0.2
 3   0.6    NA   0.2
 4   0.4    NA   0.2
 5   1      NA   0.5
 6   0.8    NA   0.5
 7   0.6    NA   0.5
 8   0.4    NA   0.5
 9   1      NA   0.8
10   0.8    NA   0.8
11   0.6    NA   0.8
12   0.4    NA   0.8
solved <- SimSolve(design=DesignNA, b=.80, interval=c(3, 1000),
                   generate=Generate, analyse=Analyse, summarise=Summarise,
                   verbose=FALSE)
solved
# A tibble: 12 × 3
     rxx     N    dT
   <dbl> <dbl> <dbl>
 1   1     394   0.2
 2   0.8   490   0.2
 3   0.6   655   0.2
 4   0.4   990   0.2
 5   1      62   0.5
 6   0.8    80   0.5
 7   0.6   106   0.5
 8   0.4   164   0.5
 9   1      25   0.8
10   0.8    34   0.8
11   0.6    44   0.8
12   0.4    68   0.8

The take-home from this output is straightforward: within each effect size combinations, in order to achieve an 80% power rate given the reliably of the test one must use noticeably more observations as the test becomes more unreliable.

This type of information is important when performing a priori power planning as the require samples to achieve a given power rate can and will change as the instruments used become less reliable. As Levin and Subkoviak (1977) put it, to determine sample sizes “without simultaneously considering errors of measurement is to live in a ‘fool’s paradise’” (p. 337).