[R] Collinearity? Cannot get logisticRidge{ridge} to work

Kengo Inagaki kengoing.gj at gmail.com
Thu May 28 22:19:08 CEST 2015


Dr. Dalgaard,

Thank you for further clarifying the problem.
I found a few possible solutions on internet, and will try to find the solution.

This was my first time to post questions on this mailing list, and I
learned quite a bit though working on this problem.
I apologize for any impoliteness you may have noticed.

Best regards,

Kengo


2015-05-28 4:26 GMT-05:00 peter dalgaard <pdalgd at gmail.com>:
>
> On 28 May 2015, at 00:06 , Kengo Inagaki <kengoing.gj at gmail.com> wrote:
>
>> I did not understand complete separation quite well..
>> Thank you very much for clarification.
>>
>> Kengo
>>
>> 2015-05-27 17:03 GMT-05:00 David Winsemius <dwinsemius at comcast.net>:
>>>
>>> On May 27, 2015, at 3:00 PM, Kengo Inagaki wrote:
>>>
>>>> Here is the result-
>>>>
>>>>> with(a,  table(Sex, Therapy1,  Outcome) )
>>>> , , Outcome = Alive
>>>>
>>>>       Therapy1
>>>> Sex      no yes
>>>> female  0   4
>>>> male    4   5
>>>>
>>>> , , Outcome = Death
>>>>
>>>>       Therapy1
>>>> Sex      no yes
>>>> female  6   3
>>>> male    3   0
>>>
>>> So no deaths when Female had no-Therapy1 and no survivors with the opposite for those variables. Complete separation.
>
>
> Actually not quite complete separation, but just as bad.  If you look at the linear combination Sex + Therapy, you get
>
> 0 (female, no therapy)
> 1 (female, therapy OR male, no therapy
> 2 (male, therapy)
>
>
> 0: 6 dead, 0 survive
> 1: 6 dead, 8 survive
> 2: 0 dead, 5 survive
>
> and any logistic curve through (1, log(6/8)) fits the middle point and the other two will be fitted better and better as the curve gets steeper, so the fit diverges.
>
> That's a general pattern: you can have complete separation except at one point and still get divergence. Similarly (and really just the same), if you have multiple regression with k parameters and there's a k-1 dimensional hyperplane in predictor space with all responses 0  on one side and 1 on the other, but possibly both 0 and 1 _on_ the hyperplane. Google tells me that this is called quasicomplete separation.
>
> -pd
>
>>>
>>> --
>>> David.
>>>
>>>>
>>>>
>>>> 2015-05-27 16:57 GMT-05:00 David Winsemius <dwinsemius at comcast.net>:
>>>>>
>>>>> On May 27, 2015, at 2:49 PM, Kengo Inagaki wrote:
>>>>>
>>>>>> Thank you very much for your rapid response. I sincerely appreciate your input.
>>>>>> I am sorry for sending the previous email in HTML format.
>>>>>>
>>>>>> with(a,  table(Sex, Therapy1) )   shows the following.
>>>>>>        Therapy1
>>>>>> Sex      no yes
>>>>>> female  6   7
>>>>>> male    7   5
>>>>>>
>>>>>> and with(a,  table(Therapy1, Outcome) )
>>>>>> elicit the following
>>>>>>
>>>>>>      Outcome
>>>>>> Sex      Alive Death
>>>>>> female     4     9
>>>>>> male       9     3
>>>>>>
>>>>>>      Outcome
>>>>>> Therapy1 Alive Death
>>>>>>   no      4     9
>>>>>>   yes     9     3
>>>>>
>>>>> Then what about:
>>>>>
>>>>> with(a,  table(Sex, Therapy1,  Outcome) )
>>>>>
>>>>> --
>>>>> David
>>>>>
>>>>>
>>>>>>
>>>>>> As there is no zero cells, it does not seem to be complete separation.
>>>>>> I really appreciate comments.
>>>>>>
>>>>>> Kengo Inagaki
>>>>>> Memphis, TN
>>>>>>
>>>>>>
>>>>>> 2015-05-27 13:57 GMT-05:00 David Winsemius <dwinsemius at comcast.net>:
>>>>>>>
>>>>>>> On May 27, 2015, at 10:10 AM, Kengo Inagaki wrote:
>>>>>>>
>>>>>>>> I am currently working on a health care related project using R. I am
>>>>>>>> learning R while working on data analysis.
>>>>>>>>
>>>>>>>> Below is the part of the data in which i am encountering a problem.
>>>>>>>>
>>>>>>>>
>>>>>>>> Case#    Sex         Therapy1             Therapy2             Outcome
>>>>>>>>
>>>>>>>> 1              male      no
>>>>>>>> no                           Alive
>>>>>>>>
>>>>>>>
>>>>>>> snipped mangled data sent in HTML
>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> "Outcome" is the response variable and "Sex", "Therapy1", "Therapy2" are
>>>>>>>> predictor variables.
>>>>>>>>
>>>>>>>> All of the predictors are significantly associated with the outcome by
>>>>>>>> univariate analysis.
>>>>>>>>
>>>>>>>> Logistic regression runs fine with most of the predictors when "Sex" and
>>>>>>>> "Therapy1" are not included at the same time (This is a part of table that
>>>>>>>> I cut out from a larger table for ease of
>>>>>>>>
>>>>>>>> presentation and there are more predictors that i tested).
>>>>>>>
>>>>>>> Please examine the data before reaching for ridge regression:
>>>>>>>
>>>>>>> What does this show: ...
>>>>>>>
>>>>>>>  with(a,  table(Sex, Therapy1) )
>>>>>>>
>>>>>>> I predict you will see a zero cell entry. The read about "complete separation" and the so-called "Hauck-Donner effect".
>>>>>>>
>>>>>>> --
>>>>>>> David.
>>>>>>>>
>>>>>>>> However, when "Sex" and "Therapy1" are included in logistic regression
>>>>>>>> model at the same time, standard error inflates and p value gets close to 1.
>>>>>>>>
>>>>>>>> The formula used is,
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>> Model<-glm(Outcome~Sex+Therapy1,data=a,family=binomial) #I assigned a
>>>>>>>> vector "a" to represent above table.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> After doing some reading, I suspect this might be collinearity, as vif
>>>>>>>> values (using "vif()" function in car package) were sky high (8,875,841 for
>>>>>>>> both "Sex" and "Therapy1").
>>>>>>>>
>>>>>>>> Learning that ridge regression may be a solution, I attempted using
>>>>>>>> logisticRidge {ridge} using the following formula, but i get the
>>>>>>>> accomapnying error message.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>> logisticRidge(a$Outcome~a$Sex+a$Therapy1)
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Error in ifelse(y, log(p), log(1 - p)) :
>>>>>>>>
>>>>>>>> invalid to change the storage mode of a factor
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> At this point I do not have an idea how to solve this and would like to
>>>>>>>> seek help.
>>>>>>>>
>>>>>>>> I really really appreciate your input!!!
>>>>>>>>
>>>>>>>>    [[alternative HTML version deleted]]
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> David Winsemius
>>>>>>> Alameda, CA, USA
>>>>>>>
>>>>>
>>>>> David Winsemius
>>>>> Alameda, CA, USA
>>>>>
>>>
>>> David Winsemius
>>> Alameda, CA, USA
>>>
>>
>> ______________________________________________
>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>
> --
> Peter Dalgaard, Professor,
> Center for Statistics, Copenhagen Business School
> Solbjerg Plads 3, 2000 Frederiksberg, Denmark
> Phone: (+45)38153501
> Office: A 4.23
> Email: pd.mes at cbs.dk  Priv: PDalgd at gmail.com
>
>
>
>
>
>
>
>
>



More information about the R-help mailing list