[R] Comparing models in multiple regression and hierarchical linear regression

Wed Nov 8 23:43:04 CET 2006

Hello Jenifer,

your question reflects some very interesting statistical problems,
surrounding the effects of subset selection upon estimation and
inference.  There does not seem to be a right answer, as far as I am
aware, so I will offer an opinion.  I'd welcome further discussion.  I
don't have a copy of Crawley handy, so I can't speak for his
recommendations.

Firstly, I would check: are your regression assumptions satisfied for
all the models under active consideration?  If the regression
assumptions aren't satisfied, then the tests that you might perform -
even those that are used to compare between the models - can be
compromised.

Then, it really depends upon the use to which the model will be put.
I think that the motivations of modellers can be placed on a spectrum,
ranging from purely prediction at one end to purely estimation
(interpretation) at the other. The modelling decisions that you make
ought to depend on your motivations.  I infer from your post that your
goal is to at least partially to understand the model.

Some authoritative statisticians claim that trying to interpret a main
effect when you have significant interactions that include that effect
is dangerous.  This is because the knowledge of the interactions will
always modify your interpretations of the main effects, even if only
slightly.  When you try to interpret main effects in the presence of
interactions, without acknowledging those interactions, you're
implicitly averaging across the interaction parameters.  Whether or
not this is meaningful, and therefore reasonable, depends on the
underlying representativeness of the data, which in turn depends on
the purpose of the model and the sample design (if there is one!).
For example, if I'm interested in predicting tree height as a function
of diameter, I know that the tree species (and age) will interact with
diameter, because different species and ages of trees have different
shapes.  However, I will still happily assert that height increases
with diameter, on average, and might even report and interpret the
main effect (diameter) coefficient.  Even though I know that such a
report is averaging across the diameters and species in my sample, I'm
comfortable that the coefficient holds meaning to the reader.  I think
that it would be pedantic to assert otherwise.

On the other hand, at least one _other_ authoritative statistician I
know claims that he will never bother to test a term that he can't
interpret - he usually draws the line at three-way interactions.  So,
opinions are divided among the authorities.

A part of the problem is reliance on statistical significance as a
metric for model quality.  If your sample size is large, then it's
possible to detect signals in the data that don't have much practical
import.  But, it's clear from the rmse reduction that you report that
there is some predictive heft somewhere in the higher interactions.
But it may well be that only one or two of the curly interaction terms
are really important.  If that be the case, it would be important to
find them.  To do that, you might try fiddling with the stepAIC
arguments to make it less sensitive, and comparing the rmse of the
models you get with the models that you have.  I would also be tempted
to do this from the point of view of adjusting for the many tests that
stepAIC performs.  

So, if you discover that a simpler model explains about as much of the
underlying variation, then I think that life is easier.  If no such
simpler model exists, then interpretation is a problem.  If you really
can't face interpreting the complex model, then I would do the
following.

1) report and interpret the simpler model, with the caveats noted
   above (about significant interaction terms) carefully outlined.

2) report the relevant fit statistics of the best stepAIC model.

3) consider reporting fit statistics of best stepAIC models
   constrained by order of interaction, e.g., best possible model
   constrained to 2-way, 3-way, etc interactions.

(The last step provides the reader with a sense of the complexity of
the hierarchical structure of the model.)

On the question of hierarchical linear regression: it sounds to me
like what you want to do is a whole model test of two models, one
comprising the four variables and the other comprising the five
variables.  If I'm correct, then 

anova(pie.4, pie.5) 

will work.  You will have to think about the ingredients of the pie
models - should they include only main effects, or all interactions,
or all statistically significant interactions (which I think has some
bad implications for the frequentist properties of the test, someone
please correct me if I'm wrong).

You might also consider using an added-variable plot to summarize the
utility of your fifth variable.

Cheers

Andrew

On Mon, Nov 06, 2006 at 09:05:02PM -0600, Jenifer Larson-Hall wrote:
> I don?t know if this question properly belongs on this list, but I?ll ask it here because I?ve been using R to run linear regression models, and it is only in using R (after switching from using SPSS) that I have discovered the process of fitting a linear model. However, after reading Crowley (2002), Fox (2002), Verzani (2004), Dalgaard (2002) and of course searching the R-help archives I cannot find an answer to my question.
> 	I have 5 explanatory variables (NR, NS, PA, KINDERWR, WM) and one response variable (G1L1WR). A simple main effects model finds that only PA is statistically significant, and an anova comparison between a 5-variable main effects model and a 1-variable main effects model finds no difference between the models. So it is possible to simplify the model to just G1L1WR ~ PA. This leaves me with a residual standard error of 0.3026 on 35 degrees of freedom and an adjusted R2 of 0.552.
> 	I also decided, following Crawley?s (2002) advice, to create a maximal model, G1L1WR ~ NR*NS*PA*KINDERWR*WM. This full model is not a good fit, but a stepAIC through the model revealed the model which had a maximal fit:
> 
> maximal.fit=lm(formula = G1L1WR ~ NR + KINDERWR + NS + WM + PA + NR:KINDERWR + NR:NS + KINDERWR:NS + NR:WM + KINDERWR:WM + NS:WM + NR:PA + + KINDERWR:PA + NS:PA + WM:PA + NR:KINDERWR:NS + NR:KINDERWR:WM + NR:NS:WM + KINDERWR:NS:WM + NR:NS:PA + KINDERWR:NS:PA + KINDERWR:WM:PA + NR:KINDERWR:NS:WM, data = lafrance.NoNA)
> 
> All of the terms of this model have statistical t-tests, the residual standard error has gone down to 0.2102, and the adjusted R2 has increased to 0.7839. An anova shows a clear difference between the simplified model and the maximal fit model. My question is, should I really pick the maximal fit over the simple model when it is really so much harder to understand? I guess there?s really no easy answer to that, but if that?s so, then my question is?would there be anything wrong with me saying that sometimes you might value parsimony and ease of understanding over best fit? Because I don?t really know what the maximal fit model buys you. It seems unintelligible to me. All of the terms are involved in interactions to some extent, but there are 4-way interactions and 3-way interactions and 2-way interactions and I?m not sure even how to understand it. A nice tree model showed that at higher levels of PA, KINDERWR and NS affected scores. That I can understand, but that is not reflected in this model.
> 
> 	An auxiliary question, probably easier to answer, is how could I do hierarchical linear regression? The authors knew that PA would be the largest contributor to the response variable because of previous research, and their research question was whether PA would contribute anything AFTER the other 4 variables had already eaten their piece of the response variable pie. I know how to do a hierarchical regression in SPSS, and want to show in parallel how to do this in R. I did search R-help archives and didn?t find quite anything that would just plain tell me how to do hierarchical linear regression.
> 
> Thanks in advance for any help.
> 
> Dr. Jenifer Larson-Hall
> Assistant Professor of Linguistics
> University of North Texas
> 
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

-- 
Andrew Robinson  
Department of Mathematics and Statistics            Tel: +61-3-8344-9763
University of Melbourne, VIC 3010 Australia         Fax: +61-3-8344-4599
http://www.ms.unimelb.edu.au/~andrewpr
http://blogs.mbs.edu/fishing-in-the-bay/