[R] interpretation of p values for highly correlated logistic analysis

Thu Apr 1 10:03:59 CEST 2010

Thank you both for your advice. I'll follow up on it, but it is good
to know that this is a known effect.

Claus

On Wed, Mar 31, 2010 at 3:02 PM, Stephan Kolassa <Stephan.Kolassa at gmx.de> wrote:
> Hi Claus,
>
> welcome to the wonderful world of collinearity (or multicollinearity, as
> some call it)! You have a near linear relationship between some of your
> predictors, which can (and in your case does) lead to extreme parameter
> estimates, which in some cases almost cancel out (a coefficient of +/-40 on
> a categorical variable in logistic regression is a lot, and the intercept
> and two of the roman parameter estimates almost cancel out) but which are
> rather unstable (hence your high p-values).
>
> Belsley, Kuh and Welsch did some work on condition indices and variance
> decomposition proportions, and variance inflation factors are quite popular
> for diagnosing multicollinearity - google these terms for a bit, and
> enlightenment will surely follow.
>
> What can you do? You should definitely think long and hard about your data.
> Should you be doing separate regressions for some factor levels? Should you
> drop a factor from the analysis? Should you do a categorical analogue of
> Principal Components Analysis on your data before the regression? I
> personally have never done this, but correspondence analysis has been
> recommended as a "discrete alternative" to PCA on this list, see a couple of
> books by M. J. Greenacre.
>
> Best of luck!
> Stephan
>
>
> claus orourke schrieb:
>>
>> Dear list,
>>
>> I want to perform a logistic regression analysis with multiple
>> categorical predictors (i.e., a logit) on some data where there is a
>> very definite relationship between one predicator and the
>> response/independent variable. The problem I have is that in such a
>> case the p value goes very high (while I as a naive newbie would
>> expect it to crash towards 0).
>>
>> I'll illustrate my problem with some toy data. Say I have the
>> following data as an input frame:
>>
>>   roman animal colour
>> 1  alpha    dog black
>> 2   beta    cat white
>> 3  alpha    dog black
>> 4  alpha    cat black
>> 5   beta    dog white
>> 6  alpha    cat black
>> 7  gamma    dog white
>> 8  alpha    cat black
>> 9  gamma    dog white
>> 10  beta    cat white
>> 11 alpha    dog black
>> 12 alpha    cat black
>> 13 gamma    dog white
>> 14 alpha    cat black
>> 15  beta    dog white
>> 16  beta    cat black
>> 17 alpha    cat black
>> 18  beta    dog white
>>
>> In this toy data you can see that roman:alpha and roman:beta are
>> pretty good predictors of colour
>>
>> Let's say I perform logistic analysis directly on the raw data with
>> colour as a response variable:
>>
>>> options(contrasts=c("contr.treatment","contr.poly"))
>>> anal1 <- glm(data$colour~data$roman+data$animal,family=binomial)
>>
>> then I find that my P values for each individual level coefficient
>> approach 1:
>>
>> Coefficients:
>>                Estimate Std. Error z value Pr(>|z|)
>> (Intercept)       -41.65   19609.49  -0.002    0.998
>> data$romanbeta     42.35   19609.49   0.002    0.998
>> data$romangamma    43.74   31089.48   0.001    0.999
>> data$animaldog     20.48   13866.00   0.001    0.999
>>
>> while I expect the p value for roman:beta to be quite low because it
>> is a good predictor of colour:white
>>
>> On the other hand, if I then run an anova with a Chi-sq test on the
>> result model, I find as I would expect that 'roman' is a good
>> predictor of colour.
>>
>>> anova(anal1,test="Chisq")
>>
>> Analysis of Deviance Table
>>
>> Model: binomial, link: logit
>>
>> Response: data$colour
>>
>> Terms added sequentially (first to last)
>>
>>
>>            Df Deviance Resid. Df Resid. Dev P(>|Chi|)
>> NULL                           17    24.7306
>> data$roman   2  19.3239        15     5.4067 6.366e-05 ***
>> data$animal  1   1.5876        14     3.8191    0.2077
>> ---
>> Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
>>
>> Can anyone please explain why my p value is so high for the individual
>> levels?
>>
>> Sorry for what is likely a stupid question.
>>
>> Claus
>>
>> p.s., when I run logistic analysis on data that is more 'randomised'
>> everything comes out as I expect.
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>
>
>