[R] Comparing models in multiple regression and hierarchical linear regression

Tue Nov 7 04:05:02 CET 2006

I don’t know if this question properly belongs on this list, but I’ll ask it here because I’ve been using R to run linear regression models, and it is only in using R (after switching from using SPSS) that I have discovered the process of fitting a linear model. However, after reading Crowley (2002), Fox (2002), Verzani (2004), Dalgaard (2002) and of course searching the R-help archives I cannot find an answer to my question.
	I have 5 explanatory variables (NR, NS, PA, KINDERWR, WM) and one response variable (G1L1WR). A simple main effects model finds that only PA is statistically significant, and an anova comparison between a 5-variable main effects model and a 1-variable main effects model finds no difference between the models. So it is possible to simplify the model to just G1L1WR ~ PA. This leaves me with a residual standard error of 0.3026 on 35 degrees of freedom and an adjusted R2 of 0.552.
	I also decided, following Crawley’s (2002) advice, to create a maximal model, G1L1WR ~ NR*NS*PA*KINDERWR*WM. This full model is not a good fit, but a stepAIC through the model revealed the model which had a maximal fit:

maximal.fit=lm(formula = G1L1WR ~ NR + KINDERWR + NS + WM + PA + NR:KINDERWR + NR:NS + KINDERWR:NS + NR:WM + KINDERWR:WM + NS:WM + NR:PA + + KINDERWR:PA + NS:PA + WM:PA + NR:KINDERWR:NS + NR:KINDERWR:WM + NR:NS:WM + KINDERWR:NS:WM + NR:NS:PA + KINDERWR:NS:PA + KINDERWR:WM:PA + NR:KINDERWR:NS:WM, data = lafrance.NoNA)

All of the terms of this model have statistical t-tests, the residual standard error has gone down to 0.2102, and the adjusted R2 has increased to 0.7839. An anova shows a clear difference between the simplified model and the maximal fit model. My question is, should I really pick the maximal fit over the simple model when it is really so much harder to understand? I guess there’s really no easy answer to that, but if that’s so, then my question is—would there be anything wrong with me saying that sometimes you might value parsimony and ease of understanding over best fit? Because I don’t really know what the maximal fit model buys you. It seems unintelligible to me. All of the terms are involved in interactions to some extent, but there are 4-way interactions and 3-way interactions and 2-way interactions and I’m not sure even how to understand it. A nice tree model showed that at higher levels of PA, KINDERWR and NS affected scores. That I can understand, but that is not reflected in this model.

	An auxiliary question, probably easier to answer, is how could I do hierarchical linear regression? The authors knew that PA would be the largest contributor to the response variable because of previous research, and their research question was whether PA would contribute anything AFTER the other 4 variables had already eaten their piece of the response variable pie. I know how to do a hierarchical regression in SPSS, and want to show in parallel how to do this in R. I did search R-help archives and didn’t find quite anything that would just plain tell me how to do hierarchical linear regression.

Thanks in advance for any help.

Dr. Jenifer Larson-Hall
Assistant Professor of Linguistics
University of North Texas