Introduction to the SimDesign package
Phil Chalmers
2024-11-14
Source:vignettes/SimDesign-intro.Rmd
SimDesign-intro.Rmd
Seek computer programs that allow you to do the thinking. – George E. P. Box
Whether you are interested in evaluating the performance of a new
optimizer or estimation criteria, re-evaluating previous research claims
(e.g., ANOVA is ‘robust’ to violations of normality), determine power
rates for an upcoming research proposal, or simply to appease a strange
thought in your head about a new statistical idea you heard about,
designing Monte Carlo simulations can be incredibly rewarding and are
extremely important to those who are statistically oriented. However,
organizing simulations can be a challenge, particularly to those new to
the topic, where all too often coders resort to the inefficient and
error prone strategies (e.g., the dreaded “for-loop” strategy,
for-ever resulting in confusing, error prone, and simulation
specific code). The package SimDesign
is one attempt to fix
these and other issues that often arise when designing Monte Carlo
simulation experiments, while also providing a templated setup that is
designed to support many useful features that can be useful when
evaluating simulation research.
Generally speaking, Monte Carlo simulations can be broken into three major components:
- generate your data from some model/probability density function given various design conditions to be studied (e.g., sample size, distributions, group sizes, etc),
- analyse the generated data using whatever statistical analyses you are interested in (e.g., -test, ANOVA, SEMs, IRT, etc), and collect the statistics/CIs/-values/parameter estimates you are interested in, and
- summarise the results after repeating the simulations number of times to obtain empirical estimates of the population’s behavior.
Each operation above represents the essential components of the
SimDesign
package. The design component is
represented by a tibble
-like object containing the
simulation conditions to be investigated, while
generate, analyse, and
summarise represent user-defined functions which
comprise the three steps in the simulation. Each of these components are
constructed and passed to the runSimulation()
function
where the simulation steps are evaluated, ultimately returning a
tibble
-like object containing the simulation results.
A general overview
After loading the SimDesign
package, we begin by
defining the required user-constructed functions. To expedite this
process, a call to SimFunctions()
will create a template to
be filled in, where all the necessary functional arguments have been
pre-assigned, and only the body of the functions need to be modified.
The documentation of each argument can be found in the respective R help
files, however there organization is very simple conceptually.
To begin, the following code should be copied and saved to an external source (i.e., text) file.
#-------------------------------------------------------------------
library(SimDesign)
Design <- createDesign(factor1 = NA,
factor2 = NA)
#-------------------------------------------------------------------
Generate <- function(condition, fixed_objects) {
dat <- data.frame()
dat
}
Analyse <- function(condition, dat, fixed_objects) {
ret <- nc(stat1 = NaN, stat2 = NaN)
ret
}
Summarise <- function(condition, results, fixed_objects) {
ret <- c(bias = NaN, RMSE = NaN)
ret
}
#-------------------------------------------------------------------
res <- runSimulation(design=Design, replications=2, generate=Generate,
analyse=Analyse, summarise=Summarise)
res
Alternatively, if you are lazy (read: efficient) or just don’t like
copy-and-pasting, SimFunctions()
can write the output to a
file by providing a filename
argument. The following
creates a file (mysim.R
) containing the simulation
design/execution and required user-defined functions.
SimDesign::SimFunctions('mysim')
For larger simulations, you may want to use two files, and if you’d
prefer to have helpful comments included then these can be achieved with
the singlefile
and comments
arguments,
respectively.
SimFunctions('mysim', singlefile = FALSE, comments = TRUE)
The choice of using a single file or not is entirely a matter of
preference, and will not influence the overall simulation
implementation. However, should you wish to include separate analysis or
generate functions the arguments nGenerate
and
nAnalyse
can be useful to compartmentalize generally
distinct portions of the code (e.g., one analyse function for fitting
and extracting components of a structural equation model, and one
analyse function for fitting and extracting information form an item
response theory model).
Simulation: Determine estimator efficiency
As a toy example, let’s consider how the following question can be
investigated with SimDesign
:
Question: How does trimming affect recovering the mean of a distribution? Investigate this using different sample sizes with Gaussian and distributions. Also, demonstrate the effect of using the median to recover the mean.
Define the conditions
First, define the condition combinations that should be investigated.
In this case we wish to study 4 different sample sizes, and use a
symmetric and skewed distribution. The use of
createDesign()
is extremely helpful here to create a
completely crossed-design for each combination (there are 8 in
total).
Design <- createDesign(sample_size = c(30, 60, 120, 240),
distribution = c('norm', 'chi'))
Design
## # A tibble: 8 × 2
## sample_size distribution
## <dbl> <chr>
## 1 30 norm
## 2 60 norm
## 3 120 norm
## 4 240 norm
## 5 30 chi
## 6 60 chi
## 7 120 chi
## 8 240 chi
Each row in Design
represents a unique condition to be
studied in the simulation. In this case, the first condition to be
studied comes from row 1, where
and the distribution should be normal.
Define the functions
We first start by defining the data generation functional component.
The only argument accepted by this function is condition
,
which will always be a single row from the Design data.frame
object of class data.frame
. Conditions are run
sequentially from row 1 to the last row in Design
. It is
also possible to pass a fixed_objects
object to the
function for including fixed sets of population parameters and other
conditions, however for this simple simulation this input is not
required.
Generate <- function(condition, fixed_objects) {
N <- condition$sample_size
dist <- condition$distribution
if(dist == 'norm'){
dat <- rnorm(N, mean = 3)
} else if(dist == 'chi'){
dat <- rchisq(N, df = 3)
}
dat
}
As we can see above, Generate()
will return a numeric
vector of length
containing the data to be analysed each with a population mean of 3
(because a
distribution has a mean equal to its df). Next, we define the
analyse
component to analyse said data:
Analyse <- function(condition, dat, fixed_objects) {
M0 <- mean(dat)
M1 <- mean(dat, trim = .1)
M2 <- mean(dat, trim = .2)
med <- median(dat)
ret <- c(mean_no_trim=M0, mean_trim.1=M1, mean_trim.2=M2, median=med)
ret
}
This function accepts the data previously returned from
Generate()
(dat
), the condition vector
previously mentioned.
At this point, we may conceptually think of the first two functions
as being evaluated independently
times to obtain
sets of results. In other words, if we wanted the number of replications
to be 100, the first two functions would be independently run (at least)
100 times, the results from Analyse()
would be stored, and
we would then need to summarise these 100 elements into meaningful meta
statistics to describe their empirical properties. This is where
computing meta-statistics such as bias, root mean-square error,
detection rates, and so on are of primary importance. Unsurprisingly,
then, this is the purpose of the summarise
component:
Summarise <- function(condition, results, fixed_objects) {
obs_bias <- bias(results, parameter = 3)
obs_RMSE <- RMSE(results, parameter = 3)
ret <- c(bias=obs_bias, RMSE=obs_RMSE, RE=RE(obs_RMSE))
ret
}
Again, condition
is the same as was defined before,
while results
is a matrix
containing all the
results from Analyse()
, where each row represents the
result returned from each respective replication, and the number of
columns is equal to the length of a single vector returned by
Analyse()
.
That sounds much more complicated than it is — all you really need to
know for this simulation is that an
x 4 matrix called results
is available to build a suitable
summary from. Because the results is a matrix, apply()
is
useful to apply a function over each respective row. The bias and RMSE
are obtained for each respective statistic, and the overall result is
returned as a vector.
Stopping for a moment and thinking carefully, we know that each
condition
will be paired with a unique vector returned from
Summarise()
. Therefore, you might be thinking that the
result returned from the simulation will be in a rectangular form, such
as in a matrix
, data.frame
, or
tibble
. Well, you’d be right!
Putting it all together
The last stage of the SimDesign
work-flow is to pass the
four defined elements to the runSimulation()
function
which, unsurprisingly given it’s name, runs the simulation. There are
numerous options available in the function, and these should be
investigated by reading the help(runSimulation)
HTML file.
Options for performing simulations in parallel, storing/resuming
temporary results, debugging functions, and so on are available. Below
we simply request that each condition be run 1000 times on a single
processor, and finally store the results to an object called
results
.
res <- runSimulation(Design, replications = 1000, generate=Generate,
analyse=Analyse, summarise=Summarise)
##
## Design: 1/8; Replications: 1000; RAM Used: 70.4 Mb; Total Time: 0.00s
## Conditions: sample_size=30, distribution=norm
##
## Design: 2/8; Replications: 1000; RAM Used: 71.4 Mb; Total Time: 0.20s
## Conditions: sample_size=60, distribution=norm
##
## Design: 3/8; Replications: 1000; RAM Used: 71.4 Mb; Total Time: 0.39s
## Conditions: sample_size=120, distribution=norm
##
## Design: 4/8; Replications: 1000; RAM Used: 71.5 Mb; Total Time: 0.58s
## Conditions: sample_size=240, distribution=norm
##
## Design: 5/8; Replications: 1000; RAM Used: 71.5 Mb; Total Time: 0.80s
## Conditions: sample_size=30, distribution=chi
##
## Design: 6/8; Replications: 1000; RAM Used: 71.5 Mb; Total Time: 0.99s
## Conditions: sample_size=60, distribution=chi
##
## Design: 7/8; Replications: 1000; RAM Used: 71.6 Mb; Total Time: 1.18s
## Conditions: sample_size=120, distribution=chi
##
## Design: 8/8; Replications: 1000; RAM Used: 71.6 Mb; Total Time: 1.39s
## Conditions: sample_size=240, distribution=chi
##
# Final simulation object
res
## # A tibble: 8 × 19
## sample_size distribution bias.mean_no_trim bias.mean_trim.1 bias.mean_trim.2
## <dbl> <chr> <dbl> <dbl> <dbl>
## 1 30 norm -0.0048080 -0.0046834 -0.0029265
## 2 60 norm 0.0030546 0.0039219 0.0031080
## 3 120 norm 0.0011275 0.00078027 0.00057227
## 4 240 norm 0.00095261 0.0011309 0.00070515
## 5 30 chi 0.022222 -0.29863 -0.44355
## 6 60 chi -0.0042835 -0.34631 -0.49384
## 7 120 chi -0.0050244 -0.34566 -0.48818
## 8 240 chi -0.000058300 -0.34650 -0.48943
## # ℹ 14 more variables: bias.median <dbl>, RMSE.mean_no_trim <dbl>,
## # RMSE.mean_trim.1 <dbl>, RMSE.mean_trim.2 <dbl>, RMSE.median <dbl>,
## # RE.mean_no_trim <dbl>, RE.mean_trim.1 <dbl>, RE.mean_trim.2 <dbl>,
## # RE.median <dbl>, REPLICATIONS <dbl>, SIM_TIME <chr>, RAM_USED <chr>,
## # SEED <int>, COMPLETED <chr>
If runSimulation(..., store_results = TRUE)
were used,
which is the current default in the package if RAM is not an issue, then
the complete stored results can be viewed using
# Extract complete set of stored results
results <- SimResults(res)
results
## # A tibble: 8,000 × 6
## sample_size distribution mean_no_trim mean_trim.1 mean_trim.2 median
## <dbl> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 30 norm 3.18 3.14 3.15 3.17
## 2 30 norm 2.84 2.80 2.78 2.86
## 3 30 norm 3.29 3.21 3.14 2.95
## 4 30 norm 3.34 3.37 3.34 3.28
## 5 30 norm 2.78 2.77 2.77 2.76
## 6 30 norm 3.15 3.08 3.06 3.03
## 7 30 norm 3.03 3.00 3.00 2.99
## 8 30 norm 3.50 3.51 3.55 3.55
## 9 30 norm 2.95 2.91 2.90 2.94
## 10 30 norm 2.98 2.99 3.01 2.95
## # ℹ 7,990 more rows
As can be seen from the printed results from the res
object, each result from the Summarise()
function has been
paired with their respective condition, meta-statistics have been
properly named, and three additional columns have been appended to the
results: REPLICATIONS
, which indicates how many time the
conditions were performed, SIM_TIME
, indicating the time
(in seconds) it took to completely finish the respective conditions, and
SEED
, which indicates the random seeds used by
SimDesign
for each condition (for reproducibility). A call
to View()
in the R console may also be a nice way to sift
through the res
object.
Interpreting the results
In this case, visually inspecting the simulation table is enough to understand what is occurring, though for other Monte Carlo simulations use of ANOVAs, marginalized tables, and graphics should be used to capture the essentially phenomenon in the results. Monte Carlo simulations are just like collecting and analysing data for experiments, so my advice would be to put on your analysis hats and present your data as though it were data collected from the real world.
In this particular simulation, it is readily apparent that using the un-adjusted mean will adequately recover the population mean with little bias. The precision also seems to increase as sample sizes increase, which is indicated by the decreasing RMSE statistics. Generally, trimming causes less efficiency in the estimates, where greater amounts of trimming results in even less efficiency, and using the median as a proxy to estimate the mean is the least effective method. This is witnessed rather clearly in the following table, which prints the relative efficiency of the estimators:
REs <- res[,grepl('RE\\.', colnames(res))]
data.frame(Design, REs)
## sample_size distribution RE.mean_no_trim RE.mean_trim.1 RE.mean_trim.2
## 1 30 norm 1 1.0 1.1
## 2 60 norm 1 1.1 1.1
## 3 120 norm 1 1.1 1.2
## 4 240 norm 1 1.1 1.1
## 5 30 chi 1 1.3 1.9
## 6 60 chi 1 2.0 3.2
## 7 120 chi 1 3.3 5.7
## 8 240 chi 1 5.7 10.5
## RE.median
## 1 1.4
## 2 1.6
## 3 1.5
## 4 1.6
## 5 2.8
## 6 5.1
## 7 9.2
## 8 17.1
Finally, when the distribution was investigated only the un-adjusted mean accurately portrayed the population mean. This isn’t surprising, because the trimmed mean is, after all, making inferences about the population trimmed mean, and the median is making inferences about, well, the median. Only when the distributions under investigation are symmetric are the statistics able to draw inferences about the same inferences about the mean of the population.
Conceptual walk-through of what runSimulation() is doing
The following is a conceptual breakdown of what
runSimulation()
is actually doing behind the scenes. Here
we demonstrate the results from the first condition (row 1 of
Design
) to show what each function returns.
A single replication in a Monte Carlo simulation results in the following objects:
(condition <- Design[1, ])
## # A tibble: 1 × 2
## sample_size distribution
## <dbl> <chr>
## 1 30 norm
dat <- Generate(condition)
dat
## [1] 2.37 3.18 2.16 4.60 3.33 2.18 3.49 3.74 3.58 2.69 4.51 3.39 2.38 0.79 4.12
## [16] 2.96 2.98 3.94 3.82 3.59 3.92 3.78 3.07 1.01 3.62 2.94 2.84 1.53 2.52 3.42
res <- Analyse(condition, dat)
res
## mean_no_trim mean_trim.1 mean_trim.2 median
## 3.1 3.2 3.2 3.3
We can see that Generate()
returns a
numeric
vector which is accepted by Analyse()
.
The Analyse()
function then completes the analysis portion
using the generated data, and returns a named vector with the observed
parameter estimates. Of course, this is only a single replication, and
therefore is not really meaningful in the grand scheme of things; so, it
must be repeated a number of times.
# repeat 1000x
results <- matrix(0, 1000, 4)
colnames(results) <- names(res)
for(i in 1:1000){
dat <- Generate(condition)
res <- Analyse(condition, dat)
results[i, ] <- res
}
head(results)
## mean_no_trim mean_trim.1 mean_trim.2 median
## [1,] 3.1 3.1 3.1 2.9
## [2,] 3.1 3.1 3.1 3.1
## [3,] 3.1 3.1 3.0 2.8
## [4,] 2.7 2.6 2.6 2.7
## [5,] 3.2 3.2 3.2 3.0
## [6,] 3.1 3.1 3.0 3.1
The matrix stored in results
contains 1000 parameter
estimates returned from each statistic. After this is obtained, we can
move on to summarising the output through the Summarise()
function to obtain average estimates, their associated sampling error,
their efficiency, and so on.
Summarise(condition, results)
## bias.mean_no_trim bias.mean_trim.1 bias.mean_trim.2 bias.median
## -0.0011 -0.0031 -0.0035 -0.0037
## RMSE.mean_no_trim RMSE.mean_trim.1 RMSE.mean_trim.2 RMSE.median
## 0.1739 0.1777 0.1859 0.2146
## RE.mean_no_trim RE.mean_trim.1 RE.mean_trim.2 RE.median
## 1.0000 1.0442 1.1425 1.5225
This process is then repeated for each row in the Design
object until the entire simulation study is complete.
Of course, runSimulation()
does much more than this
conceptual outline, which is why it exists. Namely, errors and warnings
are controlled and tracked, data is re-drawn when needed, parallel
processing is supported, debugging is easier with the debug
input (or by inserting browser()
directly), temporary and
full results can be saved to external files, the simulation state can be
saved/restored, build-in safety features are included, and more. The
point, however, is that you as the user should not be bogged down
with the nitty-gritty details of setting up the simulation
work-flow/features; instead, you should be focusing your time on
the important generate-analyse-summarise steps, organized in the body of
the above functions, that are required to obtain your interesting
simulation results. After all, the point designing a computer simulation
experiment is to understand the resulting output, not to become a master
of all aspects of your select computing language pertaining to object
storage, parallel processing, RAM storage, defensive coding, progress
reporting, reproducibility, post-processing, …, ad nauseam.
To access further examples and instructions feel free to visit the package wiki on Github