Simulations Involving Missing Data

Author

Phil Chalmers

For some simulations, the purpose is to determine how well estimators behave in the presence of missing data. Several approaches exist for dealing with missing data including case-wise removal, pairwise removal, full-information maximum-likelihood estimation, and multiple imputation. This simulation is simply a demonstration of how to use the add_missing() function to the generate step to create different missing data mechanisms.

Define the functions

The first simulation defined here is simply a missing at random scheme where 20% of the observation are missing. Three sample size conditions are studied to determine the effect of list-wise versus pairwise removal of missing data when computing correlation coefficients.

library(SimDesign)
#SimFunctions(comments = FALSE)

### Define design conditions
Design <- createDesign(N = c(50, 100, 200))

#--------------------------------------------------------------------------

Generate <- function(condition, fixed_objects = NULL) {
    cormat <- matrix(.5, 3, 3)
    diag(cormat) <- 1
    dat <- rmvnorm(condition$N, sigma = cormat) # from SimDesign
    dat <- apply(dat, 2, add_missing, rate = .2)
    dat
}

Analyse <- function(condition, dat, fixed_objects = NULL) {
    r0 <- cor(dat, use = 'complete.obs')
    r1 <- cor(dat, use = 'pairwise.complete')
    pick <- lower.tri(r0)
    ret <- c(listwise=mean(r0[pick]), pairwise=mean(r1[pick]))
    ret
}

Summarise <- function(condition, results, fixed_objects = NULL) {
    obs_bias <- bias(results, parameter = .5)
    obs_RMSE <- RMSE(results, parameter = .5)
    ret <- c(bias=obs_bias, RMSE=obs_RMSE, RE = RE(obs_RMSE))
    ret
}

#--------------------------------------------------------------------------

### Run the simulation
res <- runSimulation(Design, replications=1000, verbose=FALSE, parallel = TRUE,
                     generate=Generate, analyse=Analyse, summarise=Summarise)
res
# A tibble: 3 × 12
      N bias.listwise bias.pairwise RMSE.listwise RMSE.pairwise RE.listwise
  <dbl>         <dbl>         <dbl>         <dbl>         <dbl>       <dbl>
1    50   -0.011128     -0.0092286       0.12446       0.10316            1
2   100   -0.0036090    -0.0029115       0.079858      0.068365           1
3   200   -0.00013245   -0.00056474      0.056916      0.047614           1
# ℹ 6 more variables: RE.pairwise <dbl>, REPLICATIONS <dbl>, SIM_TIME <chr>,
#   SEED <int>, COMPLETED <chr>, WARNINGS <int>

Not surprisingly, these results suggest that computing correlations with the pairwise-complete method is more efficient than removing rows in a list-wise fashion, though both approaches results in unbiased estimates when the missing data mechanism is MCAR. The selected removal mechanism appears to be less of an issue as the sample size increases, however in general pairwise complete provides better results.

MNAR as the missing data mechanism

This is essentially the same simulation as a above, however the missing data mechanism is selected such that extremely positive values in the data are more likely to be set to NA. This creates a missing not a random effect (a.k.a., non-ignorable missingness).

Generate <- function(condition, fixed_objects = NULL) {
    fun <- function(y) ifelse(y > 1, .5, 0)
    cormat <- matrix(.5, 3, 3)
    diag(cormat) <- 1
    dat <- rmvnorm(condition$N, sigma = cormat)
    dat <- apply(dat, 2, add_missing, fun=fun)
    dat
}

res <- runSimulation(Design, replications=1000, verbose=FALSE, parallel = TRUE,
                     generate=Generate, analyse=Analyse, summarise=Summarise)
res
# A tibble: 3 × 12
      N bias.listwise bias.pairwise RMSE.listwise RMSE.pairwise RE.listwise
  <dbl>         <dbl>         <dbl>         <dbl>         <dbl>       <dbl>
1    50     -0.082072     -0.066482      0.13121       0.11666            1
2   100     -0.075474     -0.060518      0.10368       0.090278           1
3   200     -0.073830     -0.058527      0.088492      0.074428           1
# ℹ 6 more variables: RE.pairwise <dbl>, REPLICATIONS <dbl>, SIM_TIME <chr>,
#   SEED <int>, COMPLETED <chr>, WARNINGS <int>

In this case we can see the effect of MNAR influencing the bias of the correlation values. This particular missing data mechanism causes the observed parameter estimates to be too lower on average, thereby underestimating the true magnitude of the correlation statistics.