[R] drop1() seems to give unexpected results compare to anova()
Thomas P C Chu
tchu11 at netscape.net
Fri Aug 1 11:00:29 CEST 2008
Dear all,
I have been trying to investigate the behaviour of different weights in
weighted regression for a dataset with lots of missing data. As a start
I simulated some data using the following:
library(MASS)
N <- 200
sigma <- matrix(c(1, .5, .5, 1), nrow = 2)
sim.set <- as.data.frame(mvrnorm(N, c(0, 0), sigma))
colnames(sim.set) <- c('x1', 'x2') # x1 & x2 are correlated
sim.set$x3 <- rnorm(N, 0, 1) # x3 is independent
sim.set$x4 <- rnorm(N, 0, 1) # x4 is red herring
sim.set$y = 4 * sim.set$x1 + 5 * sim.set$x2 + 6 * sim.set$x3 # y is
outcome
I then checked the correlation of my simulated data and fitted a linear
regression to check if y = 4 * x1 + 5 * x2 + 6 * x3 indeed.
round(cor(sim.set), 2)
summary(model <- lm(y ~ ., data = sim.set))
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -5.423e-17 1.256e-16 -4.320e-01 0.666
x1 4.000e+00 1.388e-16 2.881e+16 <2e-16 ***
x2 5.000e+00 1.441e-16 3.470e+16 <2e-16 ***
x3 6.000e+00 1.188e-16 5.051e+16 <2e-16 ***
x4 -8.150e-18 1.165e-16 -7.000e-02 0.944
anova(model)
Df Sum Sq Mean Sq F value Pr(>F)
x1 1 8686.1 8686.1 2.8218e+33 <2e-16 ***
x2 1 5568.7 5568.7 1.8091e+33 <2e-16 ***
x3 1 7852.1 7852.1 2.5509e+33 <2e-16 ***
x4 1 1.507e-32 1.507e-32 4.9000e-03 0.9443
Residuals 195 6.002e-28 3.078e-30
All was well so far, as x4 was identified as not significant and its
coeff was almost 0 (because I made it so in the first place). Now I
expected it to be dropped in stepwise:
step(model, direction = 'both', test = 'F')
drop1(model, test = 'F')
dropterm(model, test = 'F')
Df Sum of Sq RSS AIC F value Pr(F)
<none> 6.002e-28 -13585.7
x1 1 2555.1 2555.1 517.5 8.3006e+32 < 2.2e-16 ***
x2 1 3707.0 3707.0 591.9 1.2043e+33 < 2.2e-16 ***
x3 1 7851.9 7851.9 742.0 2.5508e+33 < 2.2e-16 ***
x4 1 2.118e-27 2.718e-27 -13285.6 6.8806e+02 < 2.2e-16 ***
Neither of those 3 lines of commands managed to drop x4 and its P value
magically decreased from 0.94 to almost 0! I am also baffled by how R
calculated those RSS. However, if I fitted the smaller model and
compared it with the original one by hand, I got the expected answer:
summary(model2 <- lm(y ~ x1 + x2 + x3, data = sim.set))
anova(model, model2, test = 'F')
Model 1: y ~ x1 + x2 + x3 + x4
Model 2: y ~ x1 + x2 + x3
Res.Df RSS Df Sum of Sq F Pr(>F)
1 195 6.0025e-28
2 196 6.0026e-28 -1 -2.0000e-32 0.0049 0.9443
Interestingly, if I had started from a null model I ended up with y = 4
* x1 + 5 * x2 + 6 * x3 and R did not add x4 into the model as expected.
summary(model1 <- lm(y ~ 1, data = sim.set))
step(model1, direction = 'both', scope = .~. + x1 + x2 + x3 + x4, test
= 'F')
add1(model1, scope = .~. + x1 + x2 + x3 + x4, test = 'F')
addterm(model1, scope = .~. + x1 + x2 + x3 + x4, test = 'F')
Df Sum of Sq RSS AIC F value Pr(F)
<none> 22107.0 943.1
x1 1 8686.1 13420.8 845.2 128.1478 <2e-16 ***
x2 1 11658.7 10448.3 795.2 220.9377 <2e-16 ***
x3 1 11045.4 11061.6 806.6 197.7096 <2e-16 ***
x4 1 13.4 22093.6 944.9 0.1199 0.7295
I'm not sure what is going on. I am running R 2.7.1 on Ubuntu Linux,
with all components up to date. Thank you in advance for all your
thoughts and replies.
Yours sincerely,
Thomas P C Chu
University of Leicester
LE1 7RH
UK
________________________________________________________________________
AOL Email goes Mobile! You can now read your AOL Emails whilst on the
move. Sign up for a free AOL Email account with unlimited storage today.
**************************************
See what's new at http://www.aol.com
More information about the R-help
mailing list