[R] generalized linear model (glm) and "stepAIC"

Sat Jul 11 02:45:50 CEST 2009

  If you possibly can, you should get some local statistical advice.
There are a number of pitfalls involved in what you're doing.
Frank Harrell's book is a good reference, but may be too advanced
if you are a beginner.  You are right on the edge of having too little
data for what you want to do (the rule of thumb is that you should
have at least 10-20 responses per predictor), and stepwise regression is
known
to inflate p-values (see e.g. Whittingham et al J. Animal Ecology 2006).
That said, here are some specific responses:

Simone Santoro wrote:
> 
> 
> I have 12 response variables (species growth rates) and two
> environmental factors that I want to test to find out a possible
> relation.
> 
> The sample size is quite small: (7<n<12, depending on each species-case).
> 

> I performed a Shapiro test (shapiro.test) to test for normal
> distribution of the responses variables and they were normally
> distribuited 10 times (over 12 possible, i.e. 12 response variables).
> 

  The Shapiro test is probably not very powerful for such a small
data set -- i.e., the data could be non-normal (in fact it almost
certainly *is* non-normal) but the deviation
is not detectable ...  where do your growth rates come from?
Can you make a guess at their probable distribution?

> I performed a Generalized Linear Model in R-software (MASS package),
> and I selected models by automatic backward stepwise (stepAIC
> procedure) considering as the starting model the one with the additive
> effects of both the factors. This is the case for six species growth
> rates (six cases) but for the others six I tested the effect of just
> one factor ("x2", see below) using just the "glm" procedure.
> 

If you're assuming normality you don't need glm(), just lm(),
although glm() will give you the same answer less efficiently
(a common confusion, probably due in part to SAS's PROC GLM,
which is about the same as lm()).

Why different procedures for different cases?

> So, my object containing the data is called "data" and, this is the editor
> for the first species (sp1):
> 
> GLM1<-glm(growth.sp1~x1+x2,family=gaussian, data)
> 
> MOD.SELECTION<-stepAIC(GLM1, trace=TRUE) 
> 
> summary(MOD.SELECTION)
> 
> 
> Here I attach an example of one of these analyses and after I finally
> give you my questions (I hope not to be too long-winded!!):
> 
> 
> 
>> sp1.starting.model<-glm(sp1~x1+x2,family=gaussian, data)
> 
>> sp1.back<-stepAIC(sp1.starting.model, trace=TRUE)
> 
> Start:  AIC=63.6
> 
> sp1 ~ x1 + x2
> 
>        Df Deviance     AIC
> 
> - x2     1   73.490  61.801
> 
> <none>      72.278  63.602
> 
> - x1    1  122.659  67.949
> 
> 
> 
> Step:  AIC=61.8
> 
> gpf ~ x1
> 
> 
> 
>        Df Deviance     AIC
> 
> <none>      73.490  61.801
> 
> - x1    1  126.400  66.309
> 
>> summary(sp1.back)
> 
> 
> 
> Call:
> 
> glm(formula = sp1 ~ x1, family = gaussian, data = data)
> 
> 
> 
> Deviance Residuals: 
> 
>      Min        1Q    Median        3Q       Max  
> 
> -5.04833  -1.15233  -0.06802   0.81325   5.11464  
> 
> 
> 
> Coefficients:
> 
>             Estimate Std. Error t value Pr(>|t|)  
> 
> (Intercept) -7.62399    3.11127  -2.450   0.0342 *
> 
> x1           0.20595    0.07675   2.683   0.0230 *
> 
> ---
> 
> Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 
> 
> 
> 
> (Dispersion parameter for gaussian family taken to be 7.348965)
> 
> 
> 
>     Null deviance: 126.40  on 11  degrees of freedom
> 
> Residual deviance:  73.49  on 10  degrees of freedom
> 
>   (1 observation deleted due to missingness)
> 
> AIC: 61.801
> 
> 
> 
> Number of Fisher Scoring iterations: 2
> 
> 
> 

 You would probably be better off just doing summary() and
looking at the p-values of the two predictors (if you must ...)

Why are you using AIC if you're interested in testing
relationships rather than prediction?

> THE QUESTIONS:
> 
> 1) Can I trust in the existence of such statistical relation? I mean: is
> there a way to know the power of this test in R?
> 

There are power tests in R, but I don't know if there are any specifically
for this case (two-predictor regression).  Remember that power applies to
the probability of type II (falsely failing to reject null hypothesis)
errors.

> 2) I decided to use always "family=gaussian" because I have also
> negative values in my response variable and I cannot manage it in a
> different way. In fact I was not able to use as link function a
> "negative binomial" as I previously did in SAS because of negative
> values of response variable (as R "told" me when I tried)
> 

Is this a question?  As above, glm() with gaussian family and
identity (default) link is equivalent to lm().

> 3) How should I interpret the dispersion value R give me (in the case
> reported it was "7.3")? I mean, what range of values (if it does exist)
> I would expect to make the result reliable in the case of
> "family=gaussian" (I'm not interested in prediction but just in finding
> a statistical relation)?
> 

In the case of the Gaussian, the dispersion is just the variance.
If you're worried about overdispersion, that's really just a problem
with models (binomial, Poisson) that assume a fixed relationship
between mean and variance.

  If you really were going to use AIC, you should strongly
consider AICc -- corrected for small sample sizes ...

  Ben Bolker

-- 
View this message in context: http://www.nabble.com/generalized-linear-model-%28glm%29-and-%22stepAIC%22-tp24433331p24436223.html
Sent from the R help mailing list archive at Nabble.com.