Comparison of For-Loop and SimDesign code

Author

Phil Chalmers

This is an example taken from the tutorial article: “Conducting Simulation Studies in the R Programming Environment” in Tutorials in Quantitative Methods for Psychology (2013, Vol. 9(2), p. 43-60) by Kevin A. Hallgren. While the majority of the article seems to be good at explaining the mechanics of performing Monte Carlo simulations in R, the implementation unfortunately used the for loop strategy, and demonstrates very nicely why for loops make coding simulations very difficult to read, modify, and debug.

The following is the code from Appendix A regarding how Sobel’s test for mediation performs. However, to save space the comments have been removed. As well, the original code itself actually contained typographical errors (a phenomenon Sigal and Chalmers, 2016, argued happens quite often with the for-loop strategy), which I have commented on below. For further details about the simulation, see the original article.

Simulation Using For-Loops

N_list = c(100, 300)
a_list = c(-.3, 0, .3)
b_list = c(-.3, 0, .3)
cp_list = c(-.2, 0, .2) 
reps = 1000

set.seed(192)

sobel_test <- function(X, M, Y){
    M_X = lm(M ~ X)
    Y_XM = lm(Y ~ X + M)
    a = coefficients(M_X)[2]
    b = coefficients(Y_XM)[3]
    stdera = summary(M_X)$coefficients[2,2] 
    stderb = summary(Y_XM)$coefficients[3,2]
    sobelz = a*b / sqrt(b^2 * stdera^2 + a^2 * stderb^2) 
    return(sobelz)
}

# the following line was added to determine the computation time
time0 <- Sys.time()

d = NULL
for (N in N_list){
    for (a in a_list){ #this line was fixed; originally was "for (a in 1:a_list){"
        for (b in b_list){
            for (cp in cp_list){
                for (i in 1:reps){
                    X = rnorm(N, 0, 1)
                    M = a*X + rnorm(N, 0, 1)
                    Y = cp*X + b*M + rnorm(N, 0, 1)
                    d = rbind(d, c(i, a, b, cp, N, 1,
                                   sobel_test(X, M, Y)))
                    d = rbind(d, c(i, a, b, cp, N, 2,
                                   sobel_test(X, Y, M)))
                }
            }
        }
    }
}
colnames(d) = c("iteration", "a", "b", "cp", "N", "model", "Sobel_z")
d = data.frame(d)

# the following line was added to determine the computation time
time1 <- Sys.time()
time1 - time0

Time difference of 4.77 mins

head(d)

  iteration    a    b   cp   N model Sobel_z
1         1 -0.3 -0.3 -0.2 100     1  1.7836
2         1 -0.3 -0.3 -0.2 100     2  0.3398
3         2 -0.3 -0.3 -0.2 100     1  2.1927
4         2 -0.3 -0.3 -0.2 100     2 -0.0064
5         3 -0.3 -0.3 -0.2 100     1  1.6929
6         3 -0.3 -0.3 -0.2 100     2 -0.5918

Simulation Using SimDesign

Compare the previous for-loop approach with SimDesign. Personally, I find it more telling what the \(p\)-values are doing instead of how the \(z\)-values are behaving, so the Sobel test function was modified to return both the \(z\) and \(p\)-values.

# Generate then edit R template file 'Hallgren2013.R'
# SimDesign::SimFunctions('Hallgren2013')

# function returns large-sample p-values and other information
sobel_test <- function(X, M, Y){
    M_X <- lm(M ~ X)
    Y_XM <- lm(Y ~ X + M)
    a <- coefficients(M_X)[2] # extract from numeric vector
    b <- coefficients(Y_XM)[3]
    stdera <- summary(M_X)$coefficients[2,2] # extract from list first then matrix 
    stderb <- summary(Y_XM)$coefficients[3,2]
    sobelz <- a*b / sqrt(b^2 * stdera^2 + a^2 * stderb^2)
    sobelp <- pnorm(abs(sobelz), lower.tail = FALSE)*2
    ret <- data.frame(a=a, SE_a=stdera, b=b, SE_b=stderb, 
                      z=sobelz, p=sobelp)
    ret
}

#-------------------------------------------------------------------
library(SimDesign)

# fully-crossed simulation experiment
Design <- createDesign(N = c(100, 300),
                       a = c(-.3, 0, .3),
                       b = c(-.3, 0, .3),
                       cp = c(-.2, 0, .2))

#-------------------------------------------------------------------
Generate <- function(condition, fixed_objects = NULL) {
    Attach(condition) # make N, a, b, and cp accessable
    X <- rnorm(N)
    M <- a*X + rnorm(N)
    Y <- cp*X + b*M + rnorm(N)
    dat <- data.frame(X=X, M=M, Y=Y) 
    dat    
}

Analyse <- function(condition, dat, fixed_objects = NULL) {
    Attach(dat) # make objects X, M, and Y directly accessible
    sobel <- sobel_test(X=X, M=M, Y=Y)$p      
    sobel_incorrect <- sobel_test(X=X, M=Y, Y=M)$p    
    ret <- c(sobel=sobel, sobel_incorrect=sobel_incorrect) 
    ret    # named vector of p-values
}

Summarise <- function(condition, results, fixed_objects = NULL) {
    ret <- EDR(results, alpha = .05) # results object is a 'data.frame'
    ret    # empirical detection rate returned
}

#-------------------------------------------------------------------
res <- runSimulation(design=Design, replications=1000, generate=Generate, 
                     analyse=Analyse, summarise=Summarise)



Design: 1/54;   Replications: 1000;   RAM Used: 58.5 Mb;   Total Time: 0.00s 
 Conditions: N=100, a=-0.3, b=-0.3, cp=-0.2


Design: 2/54;   Replications: 1000;   RAM Used: 59.4 Mb;   Total Time: 5.44s 
 Conditions: N=300, a=-0.3, b=-0.3, cp=-0.2


Design: 3/54;   Replications: 1000;   RAM Used: 59.4 Mb;   Total Time: 11.03s 
 Conditions: N=100, a=0, b=-0.3, cp=-0.2


Design: 4/54;   Replications: 1000;   RAM Used: 59.4 Mb;   Total Time: 16.41s 
 Conditions: N=300, a=0, b=-0.3, cp=-0.2


Design: 5/54;   Replications: 1000;   RAM Used: 59.4 Mb;   Total Time: 21.92s 
 Conditions: N=100, a=0.3, b=-0.3, cp=-0.2


Design: 6/54;   Replications: 1000;   RAM Used: 59.5 Mb;   Total Time: 27.29s 
 Conditions: N=300, a=0.3, b=-0.3, cp=-0.2


Design: 7/54;   Replications: 1000;   RAM Used: 59.5 Mb;   Total Time: 32.82s 
 Conditions: N=100, a=-0.3, b=0, cp=-0.2


Design: 8/54;   Replications: 1000;   RAM Used: 59.5 Mb;   Total Time: 38.16s 
 Conditions: N=300, a=-0.3, b=0, cp=-0.2


Design: 9/54;   Replications: 1000;   RAM Used: 59.5 Mb;   Total Time: 43.69s 
 Conditions: N=100, a=0, b=0, cp=-0.2


Design: 10/54;   Replications: 1000;   RAM Used: 59.5 Mb;   Total Time: 49.01s 
 Conditions: N=300, a=0, b=0, cp=-0.2


Design: 11/54;   Replications: 1000;   RAM Used: 59.5 Mb;   Total Time: 54.55s 
 Conditions: N=100, a=0.3, b=0, cp=-0.2


Design: 12/54;   Replications: 1000;   RAM Used: 59.6 Mb;   Total Time: 59.88s 
 Conditions: N=300, a=0.3, b=0, cp=-0.2


Design: 13/54;   Replications: 1000;   RAM Used: 59.6 Mb;   Total Time: 01m 5.37s 
 Conditions: N=100, a=-0.3, b=0.3, cp=-0.2


Design: 14/54;   Replications: 1000;   RAM Used: 59.6 Mb;   Total Time: 01m 10.71s 
 Conditions: N=300, a=-0.3, b=0.3, cp=-0.2


Design: 15/54;   Replications: 1000;   RAM Used: 59.6 Mb;   Total Time: 01m 16.21s 
 Conditions: N=100, a=0, b=0.3, cp=-0.2


Design: 16/54;   Replications: 1000;   RAM Used: 59.6 Mb;   Total Time: 01m 21.58s 
 Conditions: N=300, a=0, b=0.3, cp=-0.2


Design: 17/54;   Replications: 1000;   RAM Used: 59.6 Mb;   Total Time: 01m 27.07s 
 Conditions: N=100, a=0.3, b=0.3, cp=-0.2


Design: 18/54;   Replications: 1000;   RAM Used: 59.7 Mb;   Total Time: 01m 32.43s 
 Conditions: N=300, a=0.3, b=0.3, cp=-0.2


Design: 19/54;   Replications: 1000;   RAM Used: 59.7 Mb;   Total Time: 01m 37.91s 
 Conditions: N=100, a=-0.3, b=-0.3, cp=0


Design: 20/54;   Replications: 1000;   RAM Used: 59.7 Mb;   Total Time: 01m 43.30s 
 Conditions: N=300, a=-0.3, b=-0.3, cp=0


Design: 21/54;   Replications: 1000;   RAM Used: 59.7 Mb;   Total Time: 01m 48.92s 
 Conditions: N=100, a=0, b=-0.3, cp=0


Design: 22/54;   Replications: 1000;   RAM Used: 59.7 Mb;   Total Time: 01m 54.26s 
 Conditions: N=300, a=0, b=-0.3, cp=0


Design: 23/54;   Replications: 1000;   RAM Used: 59.8 Mb;   Total Time: 01m 59.76s 
 Conditions: N=100, a=0.3, b=-0.3, cp=0


Design: 24/54;   Replications: 1000;   RAM Used: 59.8 Mb;   Total Time: 02m 5.10s 
 Conditions: N=300, a=0.3, b=-0.3, cp=0


Design: 25/54;   Replications: 1000;   RAM Used: 59.8 Mb;   Total Time: 02m 10.61s 
 Conditions: N=100, a=-0.3, b=0, cp=0


Design: 26/54;   Replications: 1000;   RAM Used: 59.8 Mb;   Total Time: 02m 15.97s 
 Conditions: N=300, a=-0.3, b=0, cp=0


Design: 27/54;   Replications: 1000;   RAM Used: 59.8 Mb;   Total Time: 02m 21.50s 
 Conditions: N=100, a=0, b=0, cp=0


Design: 28/54;   Replications: 1000;   RAM Used: 59.8 Mb;   Total Time: 02m 26.88s 
 Conditions: N=300, a=0, b=0, cp=0


Design: 29/54;   Replications: 1000;   RAM Used: 59.9 Mb;   Total Time: 02m 32.48s 
 Conditions: N=100, a=0.3, b=0, cp=0


Design: 30/54;   Replications: 1000;   RAM Used: 59.9 Mb;   Total Time: 02m 37.76s 
 Conditions: N=300, a=0.3, b=0, cp=0


Design: 31/54;   Replications: 1000;   RAM Used: 59.9 Mb;   Total Time: 02m 43.31s 
 Conditions: N=100, a=-0.3, b=0.3, cp=0


Design: 32/54;   Replications: 1000;   RAM Used: 59.9 Mb;   Total Time: 02m 48.59s 
 Conditions: N=300, a=-0.3, b=0.3, cp=0


Design: 33/54;   Replications: 1000;   RAM Used: 59.9 Mb;   Total Time: 02m 54.18s 
 Conditions: N=100, a=0, b=0.3, cp=0


Design: 34/54;   Replications: 1000;   RAM Used: 59.9 Mb;   Total Time: 02m 59.48s 
 Conditions: N=300, a=0, b=0.3, cp=0


Design: 35/54;   Replications: 1000;   RAM Used: 60 Mb;   Total Time: 03m 5.06s 
 Conditions: N=100, a=0.3, b=0.3, cp=0


Design: 36/54;   Replications: 1000;   RAM Used: 60 Mb;   Total Time: 03m 10.43s 
 Conditions: N=300, a=0.3, b=0.3, cp=0


Design: 37/54;   Replications: 1000;   RAM Used: 60 Mb;   Total Time: 03m 16.33s 
 Conditions: N=100, a=-0.3, b=-0.3, cp=0.2


Design: 38/54;   Replications: 1000;   RAM Used: 60 Mb;   Total Time: 03m 21.66s 
 Conditions: N=300, a=-0.3, b=-0.3, cp=0.2


Design: 39/54;   Replications: 1000;   RAM Used: 60 Mb;   Total Time: 03m 27.21s 
 Conditions: N=100, a=0, b=-0.3, cp=0.2


Design: 40/54;   Replications: 1000;   RAM Used: 60 Mb;   Total Time: 03m 32.52s 
 Conditions: N=300, a=0, b=-0.3, cp=0.2


Design: 41/54;   Replications: 1000;   RAM Used: 60.1 Mb;   Total Time: 03m 38.10s 
 Conditions: N=100, a=0.3, b=-0.3, cp=0.2


Design: 42/54;   Replications: 1000;   RAM Used: 60.1 Mb;   Total Time: 03m 43.44s 
 Conditions: N=300, a=0.3, b=-0.3, cp=0.2


Design: 43/54;   Replications: 1000;   RAM Used: 60.1 Mb;   Total Time: 03m 49.03s 
 Conditions: N=100, a=-0.3, b=0, cp=0.2


Design: 44/54;   Replications: 1000;   RAM Used: 60.1 Mb;   Total Time: 03m 54.35s 
 Conditions: N=300, a=-0.3, b=0, cp=0.2


Design: 45/54;   Replications: 1000;   RAM Used: 60.1 Mb;   Total Time: 03m 59.94s 
 Conditions: N=100, a=0, b=0, cp=0.2


Design: 46/54;   Replications: 1000;   RAM Used: 60.1 Mb;   Total Time: 04m 5.20s 
 Conditions: N=300, a=0, b=0, cp=0.2


Design: 47/54;   Replications: 1000;   RAM Used: 60.2 Mb;   Total Time: 04m 10.77s 
 Conditions: N=100, a=0.3, b=0, cp=0.2


Design: 48/54;   Replications: 1000;   RAM Used: 60.2 Mb;   Total Time: 04m 16.09s 
 Conditions: N=300, a=0.3, b=0, cp=0.2


Design: 49/54;   Replications: 1000;   RAM Used: 60.2 Mb;   Total Time: 04m 21.67s 
 Conditions: N=100, a=-0.3, b=0.3, cp=0.2


Design: 50/54;   Replications: 1000;   RAM Used: 60.2 Mb;   Total Time: 04m 26.95s 
 Conditions: N=300, a=-0.3, b=0.3, cp=0.2


Design: 51/54;   Replications: 1000;   RAM Used: 60.2 Mb;   Total Time: 04m 32.52s 
 Conditions: N=100, a=0, b=0.3, cp=0.2


Design: 52/54;   Replications: 1000;   RAM Used: 60.2 Mb;   Total Time: 04m 37.83s 
 Conditions: N=300, a=0, b=0.3, cp=0.2


Design: 53/54;   Replications: 1000;   RAM Used: 60.3 Mb;   Total Time: 04m 43.49s 
 Conditions: N=100, a=0.3, b=0.3, cp=0.2


Design: 54/54;   Replications: 1000;   RAM Used: 60.3 Mb;   Total Time: 04m 48.81s 
 Conditions: N=300, a=0.3, b=0.3, cp=0.2


Simulation complete. Total execution time: 04m 54.44s

res

# A tibble: 54 × 11
       N     a     b    cp sobel sobel_incorrect REPLICATIONS SIM_TIME RAM_USED
   <dbl> <dbl> <dbl> <dbl> <dbl>           <dbl>        <dbl> <chr>    <chr>   
 1   100  -0.3  -0.3  -0.2 0.487           0.063         1000 5.44s    59.4 Mb 
 2   300  -0.3  -0.3  -0.2 0.995           0.382         1000 5.59s    59.4 Mb 
 3   100   0    -0.3  -0.2 0.012           0.194         1000 5.38s    59.4 Mb 
 4   300   0    -0.3  -0.2 0.026           0.866         1000 5.52s    59.4 Mb 
 5   100   0.3  -0.3  -0.2 0.501           0.451         1000 5.36s    59.5 Mb 
 6   300   0.3  -0.3  -0.2 0.997           0.993         1000 5.53s    59.5 Mb 
 7   100  -0.3   0    -0.2 0.014           0.003         1000 5.34s    59.5 Mb 
 8   300  -0.3   0    -0.2 0.034           0.018         1000 5.52s    59.5 Mb 
 9   100   0     0    -0.2 0               0.002         1000 5.33s    59.5 Mb 
10   300   0     0    -0.2 0               0.013         1000 5.54s    59.5 Mb 
# ℹ 44 more rows
# ℹ 2 more variables: SEED <int>, COMPLETED <chr>

res

# A tibble: 54 × 11
       N     a     b    cp sobel sobel_incorrect REPLICATIONS SIM_TIME RAM_USED
   <dbl> <dbl> <dbl> <dbl> <dbl>           <dbl>        <dbl> <chr>    <chr>   
 1   100  -0.3  -0.3  -0.2 0.487           0.063         1000 5.44s    59.4 Mb 
 2   300  -0.3  -0.3  -0.2 0.995           0.382         1000 5.59s    59.4 Mb 
 3   100   0    -0.3  -0.2 0.012           0.194         1000 5.38s    59.4 Mb 
 4   300   0    -0.3  -0.2 0.026           0.866         1000 5.52s    59.4 Mb 
 5   100   0.3  -0.3  -0.2 0.501           0.451         1000 5.36s    59.5 Mb 
 6   300   0.3  -0.3  -0.2 0.997           0.993         1000 5.53s    59.5 Mb 
 7   100  -0.3   0    -0.2 0.014           0.003         1000 5.34s    59.5 Mb 
 8   300  -0.3   0    -0.2 0.034           0.018         1000 5.52s    59.5 Mb 
 9   100   0     0    -0.2 0               0.002         1000 5.33s    59.5 Mb 
10   300   0     0    -0.2 0               0.013         1000 5.54s    59.5 Mb 
# ℹ 44 more rows
# ℹ 2 more variables: SEED <int>, COMPLETED <chr>

To see all the analysis results from Design[1,], for example, inspect the object returned from SimResults().

row1 <- SimResults(res, 1)

Comparison

There are a few interesting properties to note about this comparison example. First, SimDesign is faster than the for-loop strategy, despite the fact it is technically doing more computations in the new Sobel test function and summarise steps. This is without running the SimDesign simulation code in parallel too, which would decrease the simulations times by a factor proportional to the number of cores available (and, unlike the for-loop code, only requires adding a parallel = TRUE flag to runSimulation()). Furthermore, the current state of the for loop code unfortunately cannot be used with parallel computations without a large amount of re-writing, because the object d is updated across each replications (therefore, the objects between each replication condition are not strictly independent).

Finally, the code itself is much less error prone and is much safer in case errors/warnings arise, the generation/analysis components are isolated better, the output is saved to external .rds files in order to inspect the results at a later time (thereby being more memory efficient), the code-base and results are generally more readable, …, the list goes on. All this despite the fact that the working code is nearly identical.