[R] Multiple imputation, especially in rms/Hmisc packages
Frank Harrell
f.harrell at vanderbilt.edu
Tue Aug 10 03:37:39 CEST 2010
On Mon, 9 Aug 2010, Mark Seeto wrote:
> Hello, I have a general question about combining imputations as well as a
> question specific to the rms and Hmisc packages.
>
> The situation is multiple regression on a data set where multiple
> imputation has been used to give M imputed data sets. I know how to get
> the combined estimate of the covariance matrix of the estimated
> coefficients (average the M covariance matrices from the individual
> imputations and add (M+1)/M times the between-imputation covariance
> matrix), and I know how to use this to get p-values and confidence
> intervals for the individual coefficients.
>
> What I don't know is how to combine imputations to obtain the multiple
> degree-of-freedom tests to test whether a set of coefficients are all 0.
> If there were no missing values, this would be done using the "general
> linear test" which involves fitting the full model and the reduced
> model and getting an F statistic based on the residual sums of squares and
> degrees of freedom. How does one correctly do this when there are multiple
> imputations? In the language of Hmisc, I'm asking how the values in
> anova(fmi) are calculated if fmi is the result of fit.mult.impute (I've
> looked at the anova.rms code but couldn't understand it).
In ordinary regression, the "leave some variables out and compare
R^2"-based F-test and the quicker contrast tests are equivalent. In
general we use the latter, which is expedient since it is easy to
estimate the imputation-corrected covariance matrix. In the rms and
Hmisc packages, fit.mult.impute creates the needed fit object, then
anova, contrast, summary, etc. give the correct Wald tests.
>
> The second question I have relates to rms/Hmisc specifically. When I use
> fit.mult.impute, the fitted values don't appear to match the values given
> by the predict function on the original data. Instead, they match the
> fitted values of the last imputation. For example,
>
> library(rms)
> set.seed(1)
> x1 <- rnorm(40)
> x2 <- 0.5*x1 + rnorm(40,0,3)
>
> y <- x1^2 + x2 + rnorm(40,0,3)
> x2[40] <- NA # create 1 missing value in x2
> a <- aregImpute(~ y + x1 + x2, n.impute=2, nk=3) # 2 imputations
>
> x2.imp1 <- x2; x2.imp1[40] <- a$imputed$x2[,1] # first imputed x2 vector
> x2.imp2 <- x2; x2.imp2[40] <- a$imputed$x2[,2] # second imputed x2 vector
>
> ols.imp1 <- ols(y ~ rcs(x1,3) + x2.imp1) # model on imputation 1
> ols.imp2 <- ols(y ~ rcs(x1,3) + x2.imp2) # model on imputation 2
>
> d <- data.frame(y, x1, x2)
> fmi <- fit.mult.impute(y ~ rcs(x1,3) + x2, ols, a, data=d)
>
> fmi$fitted
> predict(fmi, d)
>
> I would have expected the results of fmi$fitted and predict(fmi,d) to be
> the same except for the final NA, but they are not. Instead, fmi$fitted is
> the same as ols.imp2$fitted. Also,
>
> anova(ols.imp2)
> anova(fmi)
>
> unexpectedly give the same result in the ERROR line. I would be most
> grateful if anyone could explain this to me.
I need to add some more warnings. For many of the calculations, the
last imputation is used.
Frank
>
> Thanks,
> Mark
> --
> Mark Seeto
> Statistician
>
> National Acoustic Laboratories <http://www.nal.gov.au/>
> A Division of Australian Hearing
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
More information about the R-help
mailing list