[R] NEWBIE: Help explaining use of lm()?

Wed Nov 22 19:02:44 CET 2006

James W. MacDonald wrote:
> Hi Kevin,
> 
> The hint here would be the fact that you are coding your 'Groups' 
> variable as a factor, which implies (at least in this situation) that 
> the Groups are unordered levels.
> 
> Your understanding of a linear model fit is correct when the independent 
> varible is continuous (like the amount of hormone administered). 
> However, in this case, with unordered factor levels, a linear model is 
> the same as fitting an analysis of variance (ANOVA) model.
> 
> Note that the statistic for an ANOVA (the F-statistic) is a 
> generalization of the t-statistic to more than two groups. In essence 
> the question you are asking with the lm() fit is 'Is the mean age at 
> walking different from a baseline mean for any of these groups?' as 
> compared to the t-test which asks 'Is the mean age at walking different 
> between Group x and Group y?'.
> 
> The way you have set up the analysis uses the 'active' group as a 
> baseline and compares all others to that. In the output you supply, the 
> intercept is the mean age that the active group started walking, and it 
> appears that there is moderate evidence that the 'ctr.8w' group started 
> walking later (the estimate being a little over 2 months later).
> 
> If you were to do individual t-tests making the same comparisons you 
> would end up with the same conclusion, but slightly different p-values, 
> etc, because the degrees of freedom would be different.

One concept that this explanation seems to skip lightly over (and that
took me awhile to absorb) is what it means to substitute a factor in
place of a continuous independent variable in lm().  I don't claim
to be an expert in the field, though, so comments on my elaboration
are welcome.

The use of continuous independent variables in lm() leads to a linear
combination of these variables (m_1*x_1 + m_2*x_2 + ... + m_n*x_n) added
to an intercept (b).  If you replace one continuous variable with a factor,
(x_n -> f_1) you divide the independent variable space ([x_1,x_2,...,x_n-1])
into separate "compartments" each with their own "intercept".  That
intercept is represented as a combination of an intercept
specific to the applicable value of the factor on top of an
overall intercept:

   m_1*x_1+m_2*x_2+...+m_n-1*x_n-1 + b_f_1 + b

so that the term that used to represent the influence of a continuous
variable (m_n*x_n) is replaced with one of a list of constants
(b_a, b_b, b_c if factor f_1 has three levels a, b, and c).

In Kevin's example, no continuous variables are included at all, so
the solution coefficients simply consist of a list of
"intercepts", of which the first is present by default because the
model doesnt specify a "-1" term, and the remaining ones are the list
of intercepts b_f_1.

It is important to understand clearly that a single scaling factor
m_n applied to a continuous variable has been replaced by a specific
value defined for each level of the factor.  The loss of continuity
means that the influence of the variable no longer shows up in a
simple scaling factor (a single number) and must be called out
individually.

The compartmentalization suggests that separate lm() fits could
be applied for each level of the factor, combining the two "intercept"
terms for each fit but the use of a single fit permits the uncertainty
in separate components of the intercept to be identified with the
individual factor levels as well as overall, which (I believe)
explains Jim's assertion that your example amounts to an ANOVA.

I, too, HTH :)

> HTH,
> 
> Jim
> 
> 
> 
> Zembower, Kevin wrote:
> 
>>I'm attempting the heruclean task of teaching myself Introductory
>>Statistics and R at the same time. I'm working through Peter Dalgaard's
>>Introductory Statistics with R, but don't understand why the answer to
>>one of the exercises works. I'm hoping someone will have the patience to
>>explain the answer to me, both in the statistics and R areas.
>>
>>Exercise 6.1 says:
>>The zelazo data are in the form of a list of vectors, one for each of
>>the four groups. Convert the data to a form suitable for the use of lm,
>>and calculate the relevant tests. ...
>>
>>This stumped me right from the beginning. I thought I understood that
>>linear models tried to correlate an independent variable (such as the
>>amount of a hormone administered) to a dependent variable (such as the
>>height of a cornstalk). Its output was a model that could state, "for
>>every 10% increase in the hormone, the height increased by X%."
>>
>>The zelazo data are the ages at walking (in months) of four groups of
>>infants, two controls and two experimentals subjected to different
>>exercise regimens. I don't understand why lm() can be used at all in
>>this circumstance. My initial attempt was to use t.test(), which the
>>answer key does also. I would have never thought to use lm() except for
>>the requirement in the problem. I've pasted in the output of the
>>exercise below, for those without the dataset. Would someone explain why
>>lm() is appropriate to use in this situation, and what the results mean
>>'in plain English?'
>>
>>Thanks for your patience with a newbie. I'm comfortable asking this
>>question to this group because of the patience and understanding this
>>group has almost always shown in explaining and teaching statistics.
>>
>>Kevin Zembower
>>Internet Services Group manager
>>Center for Communication Programs
>>Bloomberg School of Public Health
>>Johns Hopkins University
>>111 Market Place, Suite 310
>>Baltimore, Maryland  21202
>>410-659-6139 
>>============================================================
>>
>>
>>>library("ISwR")
>>
>>Loading required package: survival
>>Loading required package: splines
>>
>>Attaching package: 'ISwR'
>>
>>
>>        The following object(s) are masked from package:survival :
>>
>>         lung 
>>
>>
>>
>>>data(zelazo)
>>>head(zelazo)
>>
>>$active
>>[1]  9.00  9.50  9.75 10.00 13.00  9.50
>>
>>$passive
>>[1] 11.00 10.00 10.00 11.75 10.50 15.00
>>
>>$none
>>[1] 11.50 12.00  9.00 11.50 13.25 13.00
>>
>>$ctr.8w
>>[1] 13.25 11.50 12.00 13.50 11.50
>>
>>
>>
>>>walk <- unlist(zelazo)
>>>walk
>>
>> active1  active2  active3  active4  active5  active6 passive1 passive2 
>>    9.00     9.50     9.75    10.00    13.00     9.50    11.00    10.00 
>>passive3 passive4 passive5 passive6    none1    none2    none3    none4 
>>   10.00    11.75    10.50    15.00    11.50    12.00     9.00    11.50 
>>   none5    none6  ctr.8w1  ctr.8w2  ctr.8w3  ctr.8w4  ctr.8w5 
>>   13.25    13.00    13.25    11.50    12.00    13.50    11.50 
>>
>>
>>>group <- factor(rep(1:4,c(6,6,6,5)), labels=names(zelazo))
>>>group
>>
>> [1] active  active  active  active  active  active  passive passive
>>passive
>>[10] passive passive passive none    none    none    none    none
>>none   
>>[19] ctr.8w  ctr.8w  ctr.8w  ctr.8w  ctr.8w 
>>Levels: active passive none ctr.8w
>>
>>
>>>summary(lm(walk ~ group))
>>
>>
>>Call:
>>lm(formula = walk ~ group)
>>
>>Residuals:
>>    Min      1Q  Median      3Q     Max 
>>-2.7083 -0.8500 -0.3500  0.6375  3.6250 
>>
>>Coefficients:
>>             Estimate Std. Error t value Pr(>|t|)    
>>(Intercept)   10.1250     0.6191  16.355 1.19e-12 ***
>>grouppassive   1.2500     0.8755   1.428   0.1696    
>>groupnone      1.5833     0.8755   1.809   0.0864 .  
>>groupctr.8w    2.2250     0.9182   2.423   0.0255 *  
>>---
>>Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
>>
>>Residual standard error: 1.516 on 19 degrees of freedom
>>Multiple R-Squared: 0.2528,     Adjusted R-squared: 0.1348 
>>F-statistic: 2.142 on 3 and 19 DF,  p-value: 0.1285 

-- 
---------------------------------------------------------------------------
Jeff Newmiller                        The     .....       .....  Go Live...
DCN:<jdnewmil at dcn.davis.ca.us>        Basics: ##.#.       ##.#.  Live Go...
                                       Live:   OO#.. Dead: OO#..  Playing
Research Engineer (Solar/Batteries            O.O#.       #.O#.  with
/Software/Embedded Controllers)               .OO#.       .OO#.  rocks...1k