This article is part of the Multiple Imputation in Stata series. For a list of topics covered by this series, see the Introduction.
In most cases, the hard work of using multiple imputation comes in the imputation process. Once the imputations are created and checked, Stata makes estimation using the imputed data relatively easy.
mi estimate
The main command for running estimations on imputed data is mi estimate. It is a prefix command, like svy or by, meaning that it goes in front of whatever estimation command you're running.
The mi estimate command first runs the estimation command on each imputation separately. It then combines the results using Rubin's rules and displays the output. Because the output is created by mi estimate, options that affect output, such as or to display odds ratios, must be applied to mi estimate rather than the estimation command. Thus:
mi estimate, or: logit y x
not:
mi estimate: logit y x, or
mi estimate has a list of estimation commands for which it knows Rubin's rules are appropriate. If a command is not on that list, you can tell mi estimate to apply them anyway with the cmdok ("command ok") option. However, it is your responsibility to ensure that the results will be valid.
Subsamples Based on Imputed Variables
Consider a regression like:
mi estimate: reg wage edu exp if race==1
If race is an imputed variable, then some observations will likely have a one for race in some imputations and not others. Thus the subsample to be used will vary between imputations, and mi estimate will give you an error message.
You have two options at this point. One is to simply tell mi estimate to ignore the problem with the esampvaryok option. The Stata documentation says this may result in "may result in biased or inefficient estimates" but we don't have any guidance at this time as to the seriousness of the problem.
mi estimate, esampvaryok: reg wage edu exp if race==1
The other is to not use observations that have imputed values of the variables used to select the subsample. Hopefully you created indicator variables telling you which observations are missing which variables in the process of determining whether your data are MCAR, MCAR or MNAR with:
misstable sum, gen(miss_)
If so, you can use those variables as part of your subsample selection:
mi estimate: reg wage edu exp if race==1 & !miss_race
Of course this raises the same issues as complete cases analysis, though the effects will likely be smaller.
Dropping Variables
More rarely, you could run into problems with different imputations using different sets of variables. In our experience that's been the result of perfect prediction in some imputations and not others, which suggests problems with the model being run (such as too many categorical covariates for the number of observations available). But it can also arise from the estimation command choosing different base categories. In that case specifying the base category should fix the problem.
Postestimation
Postestimation with imputed data must be done with caution. Rubin's rules require certain assumptions to be valid, notably asymptotic normality, and if a quantity does not meet those assumptions then Rubin's rules cannot provide a valid estimate of it. Fortunately, regression coefficients do meet those assumptions. Some quantities can be estimated if they are transformed to make them approximately normal, such as R-squared values. Others simply cannot, such as likelihood ratio test statistics. See White, Royston, and Wood for a list of quantities that can and cannot be combined using Rubin's Rules.
Unlike standard estimation commands, mi estimate cannot save all the information needed for postestimation tasks in the e() vector. Some tasks require the e() vector from the regression run on each completed data set. If you're planning to do postestimation, tell mi estimate to store the needed information in a small file with the saving() option:
mi estimate, saving(myestimates, replace): ...
This will create the file myestimates.ster in the current directory.
Tests of Coefficients
Hypothesis tests on coefficients can be performed using the mi test command. For testing whether coefficients are equal to zero, the syntax is the same as the regular test command. However, testing transformations or combinations of coefficients is more complicated—type help mi test for more information.
Likelihood ratio tests cannot be performed with multiply imputed data. However, if your goal is to test whether adding covariates improves your basic model, you can test the hypothesis that the coefficients on all those additional covariates are jointly zero.
Prediction
Predicted values can be treated as parameters to be estimated. Linear predictions meet the assumptions of Rubin's rules and thus they can be computed for each imputation and then combined as usual. This is done using the mi predict command, but mi predict needs the additional information contained in the estimates file saved by mi estimate. Thus the full command is:
mi predict myprediction using myestimates
Predicted probabilities do not meet the assumptions of Rubin's rules. However, you can estimate predicted probabilities by first estimating the linear prediction using mi predict and then putting the result through an inverse-logit transformation:
mi predict linear_prediction using myestimates, xb
mi xeq: gen predicted_probability=invlogit(linear_prediction)
The xb option tells mi predict to calculate the linear prediction even if the most recent regression involved probabilities.
Monte Carlo Error and the Number of Imputations
Since multiple imputation includes a random component, repeating the same analysis will give slightly different results each time (unless you set the seed of the random number generator). This is obviously an undesirable property, but acceptable as long as the amount of variation is small enough to be unimportant. The variation due to the random component is called the Monte Carlo error.
mi estimate with the mcerror option will report an estimate of the Monte Carlo error in estimation results. The process for calculating it involves leaving out one imputation at a time. White, Royston, and Wood suggest the following guidelines for what constitutes an acceptable amount of Monte Carlo error:
- The Monte Carlo error of a coefficient should be less than or equal to 10% of its standard error
- The Monte Carlo error of a coefficient's T-statistic should be less than or equal to 0.1
- The Monte Carlo error of a coefficient's P-value should be less than or equal to 0.01 if the true P-value is 0.05, or 0.02 if the true P-value is 0.1
If those conditions are not met, you should increase the number of imputations.
Example
Consider the example data set we imputed in an earlier section. It contains (fictional) wage data, which we will model using the covariates exp, edu, female, urban and race. Given what we found in the prior section, we will also interact female with exp and edu.
Data set to be analyzed (includes imputations)
Do file that carries out the analysis
Complete cases analysis (obtained with mi xeq 0: since the data set contains imputations) gives the following results:
mi xeq 0: reg wage female##(c.exp i.edu) urban i.race
m=0 data: -> reg wage female##(c.exp i.edu) urban i.race Source | SS df MS Number of obs = 1779 -------------+------------------------------ F( 12, 1766) = 98.93 Model | 1.0350e+12 12 8.6247e+10 Prob > F = 0.0000 Residual | 1.5396e+12 1766 871809267 R-squared = 0.4020 -------------+------------------------------ Adj R-squared = 0.3979 Total | 2.5746e+12 1778 1.4480e+09 Root MSE = 29526 ------------------------------------------------------------------------------ wage | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- 1.female | -7033.218 4180.626 -1.68 0.093 -15232.71 1166.277 exp | 2004.479 102.498 19.56 0.000 1803.449 2205.509 | edu | 2 | 10679.82 2634.031 4.05 0.000 5513.673 15845.96 3 | 28279.73 2844.885 9.94 0.000 22700.03 33859.42 4 | 51097.61 4591.212 11.13 0.000 42092.83 60102.39 | female#c.exp | 1 | -511.406 150.6243 -3.40 0.001 -806.8267 -215.9853 | female#edu | 1 2 | -5736.4 4041.507 -1.42 0.156 -13663.04 2190.24 1 3 | -3876.886 4208.948 -0.92 0.357 -12131.93 4378.159 1 4 | -12072.54 5845.627 -2.07 0.039 -23537.62 -607.4622 | urban | 4076.262 1577.229 2.58 0.010 982.8305 7169.694 | race | 1 | -4409.319 1739.41 -2.53 0.011 -7820.838 -997.8001 2 | -4952.449 1790.243 -2.77 0.006 -8463.667 -1441.232 | _cons | 31591.61 3200.808 9.87 0.000 25313.84 37869.38 ------------------------------------------------------------------------------
Compare with results using the imputations:
mi estimate, saving(miexan,replace): reg wage female##(c.exp i.edu) urban i.race
Multiple-imputation estimates Imputations = 5 Linear regression Number of obs = 3000 Average RVI = 0.3261 Largest FMI = 0.3672 Complete DF = 2987 DF adjustment: Small sample DF: min = 35.46 avg = 206.66 max = 710.00 Model F test: Equal FMI F( 12, 516.3) = 122.88 Within VCE type: OLS Prob > F = 0.0000 ------------------------------------------------------------------------------ wage | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- 1.female | -7637.933 3554.531 -2.15 0.034 -14676.67 -599.1996 exp | 2003.071 82.22401 24.36 0.000 1841.639 2164.502 | edu | 2 | 12388.95 2190.569 5.66 0.000 8063.544 16714.36 3 | 28619.32 2443.457 11.71 0.000 23764.93 33473.71 4 | 51773.01 4248.138 12.19 0.000 43152.79 60393.23 | female#c.exp | 1 | -459.7754 130.917 -3.51 0.001 -719.6377 -199.9131 | female#edu | 1 2 | -5981.89 3390.213 -1.76 0.080 -12676.1 712.3196 1 3 | -4640.03 3554.687 -1.31 0.194 -11672.93 2392.866 1 4 | -12926.75 5274.621 -2.45 0.018 -23517.89 -2335.615 | urban | 4467.026 1465.326 3.05 0.004 1508.758 7425.294 | race | 1 | -3221.866 1394.161 -2.31 0.021 -5960.173 -483.559 2 | -5977.193 1579.916 -3.78 0.000 -9123.292 -2831.093 | _cons | 30617.76 2545.795 12.03 0.000 25614.32 35621.2 ------------------------------------------------------------------------------
The 95% confidence intervals are smaller, which is just enough to put the P-value of female under the .05 cutoff for "significance."
These results were calculated with just five imputations, which we suggested as a starting point. How much Monte Carlo error does this leave?
mi estimate, mcerr: reg wage female##(c.exp i.edu) urban i.race
Multiple-imputation estimates Imputations = 5 Linear regression Number of obs = 3000 Average RVI = 0.3261 Largest FMI = 0.3672 Complete DF = 2987 DF adjustment: Small sample DF: min = 35.46 avg = 206.66 max = 710.00 Model F test: Equal FMI F( 12, 516.3) = 122.88 Within VCE type: OLS Prob > F = 0.0000 ------------------------------------------------------------------------------ wage | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- 1.female | -7637.933 3554.531 -2.15 0.034 -14676.67 -599.1996 | 614.5666 219.6112 0.22 0.018 810.9298 823.4979 | exp | 2003.071 82.22401 24.36 0.000 1841.639 2164.502 | 8.54593 1.876621 0.61 0.000 10.89688 7.715809 | | edu | 2 | 12388.95 2190.569 5.66 0.000 8063.544 16714.36 | 347.7053 136.7152 0.24 0.000 182.1151 637.3576 | 3 | 28619.32 2443.457 11.71 0.000 23764.93 33473.71 | 453.6681 191.0128 0.92 0.000 628.9118 692.7885 | 4 | 51773.01 4248.138 12.19 0.000 43152.79 60393.23 | 1000.615 225.8875 0.44 0.000 494.3933 1592.194 | | female#c.exp | 1 | -459.7754 130.917 -3.51 0.001 -719.6377 -199.9131 | 23.88987 6.507623 0.26 0.001 27.60468 30.27696 | | female#edu | 1 2 | -5981.89 3390.213 -1.76 0.080 -12676.1 712.3196 | 538.2219 166.8896 0.15 0.024 785.7183 522.4397 | 1 3 | -4640.03 3554.687 -1.31 0.194 -11672.93 2392.866 | 600.5342 309.3857 0.30 0.090 278.1348 1288.453 | 1 4 | -12926.75 5274.621 -2.45 0.018 -23517.89 -2335.615 | 1134.893 383.1226 0.21 0.012 1831.385 1154.305 | | urban | 4467.026 1465.326 3.05 0.004 1508.758 7425.294 | 331.6953 169.3962 0.58 0.006 715.5021 340.4445 | | race | 1 | -3221.866 1394.161 -2.31 0.021 -5960.173 -483.559 | 155.3758 26.42719 0.10 0.006 183.4623 145.6764 | 2 | -5977.193 1579.916 -3.78 0.000 -9123.292 -2831.093 | 305.4687 167.2478 0.61 0.001 251.5403 676.1527 | | _cons | 30617.76 2545.795 12.03 0.000 25614.32 35621.2 | 307.1376 84.98822 0.48 0.000 423.9462 281.053 ------------------------------------------------------------------------------ Note: values displayed beneath estimates are Monte Carlo error estimates.
A brief glance at the estimates for female shows this does not meet the suggested criteria: the Monte Carlo error on the coefficient is about 17% of the standard error rather than 10%, and the Monte Carlo error on the P-value is .018 when we'd want it to be less than .01 if we believe the true P-value is .05 or less. This suggests you should use more imputations.
Even if it turned out that the Monte Carlo error was acceptable, we'd still recommend using more imputations in this case. About 40% of the observations are missing values, so White, Royston, and Wood would suggest 40 imputations. While that may seem like a lot, the entire process from imputation to analysis (including many diagnostics) still ran in less than 15 minutes in our testing. Thus there's no reason not to use at least that many imputations when you're ready to produce final results.
Here are the results with 40 imputations, and you'll see that the Monte Carlo error now meets the guidelines:
Multiple-imputation estimates Imputations = 40 Linear regression Number of obs = 3000 Average RVI = 0.2340 Largest FMI = 0.2505 Complete DF = 2987 DF adjustment: Small sample DF: min = 494.37 avg = 723.63 max = 1046.30 Model F test: Equal FMI F( 12, 2407.3) = 134.54 Within VCE type: OLS Prob > F = 0.0000 ------------------------------------------------------------------------------ wage | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- 1.female | -8444.446 3568.764 -2.37 0.018 -15450.36 -1438.528 | 242.2941 70.94422 0.08 0.004 294.817 266.1893 | exp | 1981.635 90.55428 21.88 0.000 1803.76 2159.509 | 6.812604 2.379668 0.61 0.000 9.595854 6.819707 | | edu | 2 | 11896.82 2277.964 5.22 0.000 7423.364 16370.28 | 164.9826 51.1388 0.14 0.000 200.7386 187.6542 | 3 | 28702.41 2405.316 11.93 0.000 23980.9 33423.92 | 160.1252 56.74204 0.31 0.000 218.7132 170.9788 | 4 | 52000.64 3991.622 13.03 0.000 44158 59843.27 | 310.4559 94.26391 0.35 0.000 426.0258 288.6724 | | female#c.exp | 1 | -435.3699 126.7969 -3.43 0.001 -684.175 -186.5648 | 7.621008 1.908626 0.08 0.000 8.635785 8.400978 | | female#edu | 1 2 | -5278.077 3463.037 -1.52 0.128 -12076.34 1520.188 | 234.4022 91.09243 0.07 0.017 330.3471 259.2769 | 1 3 | -4498.973 3525.953 -1.28 0.202 -11418.57 2420.619 | 220.3948 79.45707 0.07 0.025 265.3468 277.5948 | 1 4 | -11832.2 4963.477 -2.38 0.017 -21576.09 -2088.301 | 336.8887 98.06887 0.08 0.004 383.8081 395.9982 | | urban | 4301.578 1351.3 3.18 0.002 1648.864 6954.293 | 91.36621 21.07064 0.09 0.000 104.6887 96.41936 | | race | 1 | -3478.967 1540.658 -2.26 0.024 -6505.774 -452.1594 | 118.6348 35.97534 0.11 0.006 120.0791 155.5802 | 2 | -5657.502 1575.349 -3.59 0.000 -8751.359 -2563.646 | 115.2371 35.82667 0.12 0.000 120.9807 149.2301 | | _cons | 31156.05 2678.619 11.63 0.000 25898.25 36413.85 | 176.8282 54.19707 0.23 0.000 177.3412 233.6436 ------------------------------------------------------------------------------ Note: values displayed beneath estimates are Monte Carlo error estimates.
One reasonable question is whether the interactions are actually required, and with unimputed data one might use a likelihood ratio test to answer it. With imputed data you'll instead test whether the coefficients on the interaction terms are jointly equal to zero:
mi test 1.female#c.exp 1.female#2.edu 1.female#3.edu 1.female#4.edu
note: assuming equal fractions of missing information ( 1) 1.female#c.exp = 0 ( 2) 1.female#2.edu = 0 ( 3) 1.female#3.edu = 0 ( 4) 1.female#4.edu = 0 F( 4, 156.0) = 4.63 Prob > F = 0.0015
Given that some of the terms are significantly different from zero on their own, it's no surprise that the joint test rejects the hypothesis that they are all zero.
Finally, if we wanted to calculate predicted wages, we would use the following:
mi predict wagehat using miexan
Note the use of the miexan.ster file (without needing to specify the extension) created by the initial mi estimate command. It contains the coefficients from the regressions on each completed data set, which are needed to form the individual predictions combined by mi predict.
Next: Examples
Previous: Managing Multiply Imputed Data
Last Revised: 10/12/2012