[R] interpretation of p values for highly correlated logistic analysis
Claus O'Rourke
claus.orourke at gmail.com
Thu Apr 1 10:03:59 CEST 2010
Thank you both for your advice. I'll follow up on it, but it is good
to know that this is a known effect.
Claus
On Wed, Mar 31, 2010 at 3:02 PM, Stephan Kolassa <Stephan.Kolassa at gmx.de> wrote:
> Hi Claus,
>
> welcome to the wonderful world of collinearity (or multicollinearity, as
> some call it)! You have a near linear relationship between some of your
> predictors, which can (and in your case does) lead to extreme parameter
> estimates, which in some cases almost cancel out (a coefficient of +/-40 on
> a categorical variable in logistic regression is a lot, and the intercept
> and two of the roman parameter estimates almost cancel out) but which are
> rather unstable (hence your high p-values).
>
> Belsley, Kuh and Welsch did some work on condition indices and variance
> decomposition proportions, and variance inflation factors are quite popular
> for diagnosing multicollinearity - google these terms for a bit, and
> enlightenment will surely follow.
>
> What can you do? You should definitely think long and hard about your data.
> Should you be doing separate regressions for some factor levels? Should you
> drop a factor from the analysis? Should you do a categorical analogue of
> Principal Components Analysis on your data before the regression? I
> personally have never done this, but correspondence analysis has been
> recommended as a "discrete alternative" to PCA on this list, see a couple of
> books by M. J. Greenacre.
>
> Best of luck!
> Stephan
>
>
> claus orourke schrieb:
>>
>> Dear list,
>>
>> I want to perform a logistic regression analysis with multiple
>> categorical predictors (i.e., a logit) on some data where there is a
>> very definite relationship between one predicator and the
>> response/independent variable. The problem I have is that in such a
>> case the p value goes very high (while I as a naive newbie would
>> expect it to crash towards 0).
>>
>> I'll illustrate my problem with some toy data. Say I have the
>> following data as an input frame:
>>
>> roman animal colour
>> 1 alpha dog black
>> 2 beta cat white
>> 3 alpha dog black
>> 4 alpha cat black
>> 5 beta dog white
>> 6 alpha cat black
>> 7 gamma dog white
>> 8 alpha cat black
>> 9 gamma dog white
>> 10 beta cat white
>> 11 alpha dog black
>> 12 alpha cat black
>> 13 gamma dog white
>> 14 alpha cat black
>> 15 beta dog white
>> 16 beta cat black
>> 17 alpha cat black
>> 18 beta dog white
>>
>> In this toy data you can see that roman:alpha and roman:beta are
>> pretty good predictors of colour
>>
>> Let's say I perform logistic analysis directly on the raw data with
>> colour as a response variable:
>>
>>> options(contrasts=c("contr.treatment","contr.poly"))
>>> anal1 <- glm(data$colour~data$roman+data$animal,family=binomial)
>>
>> then I find that my P values for each individual level coefficient
>> approach 1:
>>
>> Coefficients:
>> Estimate Std. Error z value Pr(>|z|)
>> (Intercept) -41.65 19609.49 -0.002 0.998
>> data$romanbeta 42.35 19609.49 0.002 0.998
>> data$romangamma 43.74 31089.48 0.001 0.999
>> data$animaldog 20.48 13866.00 0.001 0.999
>>
>> while I expect the p value for roman:beta to be quite low because it
>> is a good predictor of colour:white
>>
>> On the other hand, if I then run an anova with a Chi-sq test on the
>> result model, I find as I would expect that 'roman' is a good
>> predictor of colour.
>>
>>> anova(anal1,test="Chisq")
>>
>> Analysis of Deviance Table
>>
>> Model: binomial, link: logit
>>
>> Response: data$colour
>>
>> Terms added sequentially (first to last)
>>
>>
>> Df Deviance Resid. Df Resid. Dev P(>|Chi|)
>> NULL 17 24.7306
>> data$roman 2 19.3239 15 5.4067 6.366e-05 ***
>> data$animal 1 1.5876 14 3.8191 0.2077
>> ---
>> Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
>>
>> Can anyone please explain why my p value is so high for the individual
>> levels?
>>
>> Sorry for what is likely a stupid question.
>>
>> Claus
>>
>> p.s., when I run logistic analysis on data that is more 'randomised'
>> everything comes out as I expect.
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>
>
>
More information about the R-help
mailing list