[R] lm: order of dropped columns
Anirban Mukherjee
am253 at cornell.edu
Thu Jul 22 03:19:57 CEST 2010
Hi all,
Thanks Dennis, do greatly appreciate the detailed reply. That does
answer a significant part of what I was getting at.
Unfortunately, I did not ask my question very well. Apologize. Suppose I run
> dat <- data.frame(a = factor(rep(letters[1:3], each = 4)),
+ b = factor(rep(rep(1:2, each = 2), 3)),
+ c = rnorm(12), d=rnorm(12))
> dat$e<-0.999*dat$d
I am (intentionally) making e "almost" equal to d to trigger the rank
condition. Subsequently, I run
> summary(lm(c~a+b+d+e, data=dat))
Call:
lm(formula = c ~ a + b + d + e, data = dat)
Residuals:
Min 1Q Median 3Q Max
-1.0726 -0.5578 0.1132 0.4296 1.3397
Coefficients: (1 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.2564 0.5288 0.485 0.643
ab 1.0253 0.7234 1.417 0.199
ac -0.5434 0.6869 -0.791 0.455
b2 -0.5521 0.5513 -1.002 0.350
d -0.4098 0.2851 -1.438 0.194
e NA NA NA NA
Residual standard error: 0.91 on 7 degrees of freedom
Multiple R-squared: 0.5177, Adjusted R-squared: 0.2421
F-statistic: 1.878 on 4 and 7 DF, p-value: 0.2191
R appropriately drops e from the design matrix to make X'X
nonsingular. I am wondering how R decides (what is the convention)
whether to drop d, or to drop e. Are terms added from left to right of
the formula, and hence d retained and e dropped?
More generally, if presented with a non-singular design matrix with
column names (x1, x2, x3, x4, x5), what algorithm is used by R to
figure out which columns to drop. Eg: suppose the design matrix with
five columns (x1, x2, x3, x4, x5) has rank 3, and x3 = a1x1 + a2x2,
and x5 = a4x4. Which 2 columns would R try to drop?
I am mostly asking these questions because I am in the process of
working on a package. I want to make sure that my treatment of
singular design matrices is, as far as possible, compatible with lm.
Thanks again!
Best,
Anirban
On Wed, Jul 21, 2010 at 1:41 PM, Dennis Murphy <djmuser at gmail.com> wrote:
> Hi:
>
> (1) lm() drops columns in a rank-deficient model matrix X to make X'X
> nonsingular - this is called a full-rank reparameterization of the linear
> model.
>
> (2) How many columns of X are dropped depends on its rank, which in turn
> depends on the number of constraints in the model matrix. This is related to
> the degrees of freedom associated with each term in the corresponding linear
> model.. The default contrasts are
>> options()$contrasts
> unordered ordered
> "contr.treatment" "contr.poly"
>
> Other choices include contr.helmert() and contr.sum(). See the help page
> ?contrasts for further details. See also section 6.2 of Venables and
> Ripley's _Modern Applied Statistics with S, 4th ed._ for further information
> on the connection between the columns of the model matrix in ANOVA and the
> defined sets of contrasts.
>
> Under the default contrasts, the first column is dropped for the main effect
> terms. Here's a simple example of a balanced 2-way ANOVA with interaction:
>
> d <- data.frame(a = factor(rep(letters[1:3], each = 4)),
> b = factor(rep(rep(1:2, each = 2), 3)),
> c = rnorm(12))
>> d
> a b c
> 1 a 1 -0.77367688
> 2 a 1 -0.79069791
> 3 a 2 0.69257133
> 4 a 2 2.46788204
> 5 b 1 0.38892289
> 6 b 1 -0.03521033
> 7 b 2 -0.01071611
> 8 b 2 -0.74209425
> 9 c 1 1.36974281
> 10 c 1 -1.22775441
> 11 c 2 0.29621976
> 12 c 2 0.28208192
> m <- aov(c ~ a * b, data = d)
> model.matrix(m)
> (Intercept) ab ac b2 ab:b2 ac:b2
> 1 1 0 0 0 0 0
> 2 1 0 0 0 0 0
> 3 1 0 0 1 0 0
> 4 1 0 0 1 0 0
> 5 1 1 0 0 0 0
> 6 1 1 0 0 0 0
> 7 1 1 0 1 1 0
> 8 1 1 0 1 1 0
> 9 1 0 1 0 0 0
> 10 1 0 1 0 0 0
> 11 1 0 1 1 0 1
> 12 1 0 1 1 0 1
> attr(,"assign")
> [1] 0 1 1 2 3 3
> attr(,"contrasts")
> attr(,"contrasts")$a
> [1] "contr.treatment"
>
> attr(,"contrasts")$b
> [1] "contr.treatment"
>
> Notice that the first column of each main effect is dropped, and that the
> interaction columns are the products of the retained a columns with the
> retained b columns. The attr() components verify that the contrast method
> used for this matrix is the default contr.treatment.
>
>> anova(m)
> Analysis of Variance Table
>
> Response: c
> Df Sum Sq Mean Sq F value Pr(>F)
> a 2 0.5001 0.25003 0.2827 0.7633
> b 1 1.3700 1.36999 1.5489 0.2597
> a:b 2 4.5647 2.28235 2.5804 0.1554
> Residuals 6 5.3070 0.88450
>
> Examination of the degrees of freedom tells us that there are two
> independent contrasts for a, one independent contrast for b and two
> independent contrasts for the interaction of a and b, which are shown in
> model.matrix(m) above.
>
> If you want to reset the baselline factor level, see ?relevel. Also look
> into the contrast package on CRAN.
>
> HTH,
> Dennis
>
> On Wed, Jul 21, 2010 at 8:40 AM, Anirban Mukherjee <am253 at cornell.edu>
> wrote:
>>
>> Hi all,
>>
>> If presented with a singular design matrix, lm drops columns to make the
>> design matrix non-singular. What algorithm is used to select which (and
>> how
>> many) column(s) to drop? Particularly, given a factor, how does lm choose
>> levels of the factor to discard?
>>
>> Thanks for the help.
>>
>> Best,
>> Anirban
>>
>> [[alternative HTML version deleted]]
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>
>
More information about the R-help
mailing list