6.1 Overview of modeling choices

The specification of the imputation model is the most challenging step in multiple imputation. The imputation model should

  • account for the process that created the missing data,

  • preserve the relations in the data, and

  • preserve the uncertainty about these relations.

The idea is that adherence to these principles will yield proper imputations (cf. Section 2.3.3), and thus result in valid statistical inferences. What are the choices that we need to make, and in what order? list the following seven choices:

  1. First, we should decide whether the MAR assumption is plausible. See Sections 1.2 and 2.2.4 for an introduction to MAR and MNAR. FCS can handle both MAR and MNAR. Multiple imputation under MNAR requires additional modeling assumptions that influence the generated imputations. There are many ways to do this. Section 3.8 described one way to do so within the FCS framework. Section 6.2 deals with this issue in more detail.

  2. The second choice refers to the form of the imputation model. The form encompasses both the structural part and the assumed error distribution. In FCS the form needs to be specified for each incomplete column in the data. The choice will be steered by the scale of the variable to be imputed, and preferably incorporates knowledge about the relation between the variables. Chapter 3 described many different methods for creating univariate imputations.

  3. A third choice concerns the set of variables to include as predictors in the imputation model. The general advice is to include as many relevant variables as possible, including their interactions (Collins, Schafer, and Kam 2001). This may, however, lead to unwieldy model specifications. Section 6.3 describes the facilities within the mice() function for setting the predictor matrix.

  4. The fourth choice is whether we should impute variables that are functions of other (incomplete) variables. Many datasets contain derived variables, sum scores, interaction variables, ratios and so on. It can be useful to incorporate the transformed variables into the multiple imputation algorithm. Section 6.4 describes methods that we can use to incorporate such additional knowledge about the data.

  5. The fifth choice concerns the order in which variables should be imputed. The visit sequence may affect the convergence of the algorithm and the synchronization between derived variables. Section 6.5.1 discusses relevant options.

  6. The sixth choice concerns the setup of the starting imputations and the number of iterations \(M\). The convergence of the MICE algorithm can be monitored in many ways. Section 6.5.2 outlines some techniques that assist in this task.

  7. The seventh choice is \(m\), the number of multiply imputed datasets. Setting \(m\) too low may result in large simulation error and statistical inefficiency, especially if the fraction of missing information is high. Section 2.8 provided guidelines for setting \(m\).

Please realize that these choices are always needed. Imputation software needs to make default choices. These choices are intended to be useful across a wide range of applications. However, the default choices are not necessarily the best for the data at hand. There is simply no magical setting that always works, so often some tailoring is needed. Section 6.6 highlights some diagnostic tools that aid in determining the choices.