[R] drop1() seems to give unexpected results compare to anova()

Thomas P C Chu tchu11 at netscape.net
Fri Aug 1 11:00:29 CEST 2008


Dear all,

I have been trying to investigate the behaviour of different weights in 
weighted regression for a dataset with lots of missing data. As a start 
I simulated some data using the following:

library(MASS)
N <- 200
sigma <- matrix(c(1, .5, .5, 1), nrow = 2)
sim.set <- as.data.frame(mvrnorm(N, c(0, 0), sigma))
colnames(sim.set) <- c('x1', 'x2') # x1 & x2 are correlated
sim.set$x3 <- rnorm(N, 0, 1) # x3 is independent
sim.set$x4 <- rnorm(N, 0, 1) # x4 is red herring
sim.set$y = 4 * sim.set$x1 + 5 * sim.set$x2 + 6 * sim.set$x3 # y is 
outcome

I then checked the correlation of my simulated data and fitted a linear 
regression to check if y = 4 * x1 + 5 * x2 + 6 * x3 indeed.

round(cor(sim.set), 2)
summary(model <- lm(y ~ ., data = sim.set))

Coefficients:
 Estimate Std. Error t value Pr(>|t|)
(Intercept) -5.423e-17 1.256e-16 -4.320e-01 0.666
x1 4.000e+00 1.388e-16 2.881e+16 <2e-16 ***
x2 5.000e+00 1.441e-16 3.470e+16 <2e-16 ***
x3 6.000e+00 1.188e-16 5.051e+16 <2e-16 ***
x4 -8.150e-18 1.165e-16 -7.000e-02 0.944

anova(model)

 Df Sum Sq Mean Sq F value Pr(>F)
x1 1 8686.1 8686.1 2.8218e+33 <2e-16 ***
x2 1 5568.7 5568.7 1.8091e+33 <2e-16 ***
x3 1 7852.1 7852.1 2.5509e+33 <2e-16 ***
x4 1 1.507e-32 1.507e-32 4.9000e-03 0.9443
Residuals 195 6.002e-28 3.078e-30

All was well so far, as x4 was identified as not significant and its 
coeff was almost 0 (because I made it so in the first place). Now I 
expected it to be dropped in stepwise:

step(model, direction = 'both', test = 'F')
drop1(model, test = 'F')
dropterm(model, test = 'F')

 Df Sum of Sq RSS AIC F value Pr(F)
<none> 6.002e-28 -13585.7
x1 1 2555.1 2555.1 517.5 8.3006e+32 < 2.2e-16 ***
x2 1 3707.0 3707.0 591.9 1.2043e+33 < 2.2e-16 ***
x3 1 7851.9 7851.9 742.0 2.5508e+33 < 2.2e-16 ***
x4 1 2.118e-27 2.718e-27 -13285.6 6.8806e+02 < 2.2e-16 ***

Neither of those 3 lines of commands managed to drop x4 and its P value 
magically decreased from 0.94 to almost 0! I am also baffled by how R 
calculated those RSS. However, if I fitted the smaller model and 
compared it with the original one by hand, I got the expected answer:

summary(model2 <- lm(y ~ x1 + x2 + x3, data = sim.set))
anova(model, model2, test = 'F')

Model 1: y ~ x1 + x2 + x3 + x4
Model 2: y ~ x1 + x2 + x3
 Res.Df RSS Df Sum of Sq F Pr(>F)
1 195 6.0025e-28
2 196 6.0026e-28 -1 -2.0000e-32 0.0049 0.9443

Interestingly, if I had started from a null model I ended up with y = 4 
* x1 + 5 * x2 + 6 * x3 and R did not add x4 into the model as expected.

summary(model1 <- lm(y ~ 1, data = sim.set))
step(model1, direction = 'both', scope = .~. + x1 + x2 + x3 + x4, test 
= 'F')
add1(model1, scope = .~. + x1 + x2 + x3 + x4, test = 'F')
addterm(model1, scope = .~. + x1 + x2 + x3 + x4, test = 'F')

 Df Sum of Sq RSS AIC F value Pr(F)
<none> 22107.0 943.1
x1 1 8686.1 13420.8 845.2 128.1478 <2e-16 ***
x2 1 11658.7 10448.3 795.2 220.9377 <2e-16 ***
x3 1 11045.4 11061.6 806.6 197.7096 <2e-16 ***
x4 1 13.4 22093.6 944.9 0.1199 0.7295

I'm not sure what is going on. I am running R 2.7.1 on Ubuntu Linux, 
with all components up to date. Thank you in advance for all your 
thoughts and replies.

Yours sincerely,
Thomas P C Chu

University of Leicester
LE1 7RH
UK

________________________________________________________________________
AOL Email goes Mobile! You can now read your AOL Emails whilst on the 
move. Sign up for a free AOL Email account with unlimited storage today.



**************************************
See what's new at http://www.aol.com



More information about the R-help mailing list