[R] logistic regression model specification
Peter Dalgaard
P.Dalgaard at biostat.ku.dk
Wed Nov 14 14:17:22 CET 2007
Dylan Beaudette wrote:
> On Tuesday 13 November 2007, Peter Dalgaard wrote:
>
>> Prof Brian Ripley wrote:
>>
>>> On Tue, 13 Nov 2007, Dylan Beaudette wrote:
>>>
>>>> Hi,
>>>>
>>>> I have setup a simple logistic regression model with the glm() function,
>>>> with the follow formula:
>>>>
>>>> y ~ a + b
>>>>
>>>> where:
>>>> 'a' is a continuous variable stratified by
>>>> the levels of 'b'
>>>>
>>>>
>>>> Looking over the manual for model specification, it seems that
>>>> coefficients for unordered factors are given 'against' the first level
>>>> of that factor.
>>>>
>>> Only for the default coding.
>>>
>>>
>>>> This makes for difficult interpretation when using factor 'b' as a
>>>> stratifying model term.
>>>>
>>> Really? You realize that you have not 'stratified' on 'b', which would
>>> need the model to be a*b? What you have is a model with parallel linear
>>> predictors for different levels of 'b', and if the coefficients are not
>>> telling you what you want you should change the coding.
>>>
>
> Thanks for the comments Peter. Note that use/abuse of jargon on my part is
> augmented by my limited understanding.
>
>
>> I have to differ slightly here. "Stratification", at least in the fields
>> that I connect with, usually means that you combine information from
>> several groups via an assumption that they have a common value of a
>> parameter, which in the present case is essentially the same as assuming
>> an additive model y~a+b.
>>
>
> This was the definition of 'stratification' that I was using when formulating
> this model.
>
>
>> I share your confusion as to why the parametrization of the effects of
>> factor b should matter, though. Surely, the original poster has already
>> noticed that the estimated effect of a is the same whether or not the
>> intercept is included? The only difference I see is that the running
>> anova() or drop1() would not give interesting results for the effect of
>> b in the no-intercept variation.
>>
>
> Yes, I have noticed that the estimated effect is the same. It looks like I am
> having trouble interpreting the model formula and coefficients. An example
> from MASS, using the default R contrasts:
>
> library(MASS)
> data(whiteside)
>
> # simple additive model, relative to first level of 'Insul'
> # 'parallel regressions'
> summary(lm(Gas ~ Temp + Insul, data=whiteside))
>
> Estimate Std. Error t value Pr(>|t|)
> (Intercept) 6.55133 0.11809 55.48 <2e-16 ***
> Temp -0.33670 0.01776 -18.95 <2e-16 ***
> InsulAfter -1.56520 0.09705 -16.13 <2e-16 ***
>
>
> # same model without the intercept
> summary(lm(Gas ~ Temp + Insul -1 , data=whiteside))
>
> Estimate Std. Error t value Pr(>|t|)
> Temp -0.33670 0.01776 -18.95 <2e-16 ***
> InsulBefore 6.55133 0.11809 55.48 <2e-16 ***
> InsulAfter 4.98612 0.10268 48.56 <2e-16 ***
>
> In presenting the terms of this model, I would like to be able to talk about
> the physical meaning of the coefficients. Minus the intercept term, the
> second example presents the slopes before and after. I now understand how to
> interpret the first example, as:
>
> Intercept - 1.56520 = InsulAfter
> 6.55133 - 1.56520 = 4.98612
>
> ... the meaning of the coeficient for InsulAfter is relative to that of the
> Intercept (treatment style contrasts).
>
> With the above example the significance terms do no really change when the
> intercept is removed. Within the context of my data, removing the intercept
> has the following effect:
>
> # with intercept
> Estimate Std. Error z value Pr(>|z|)
> (Intercept) 7.296e+00 1.404e+00 5.196 2.03e-07 ***
> bd_sum -8.331e-04 1.533e-04 -5.433 5.55e-08 ***
> clastic_volcanic -1.396e-01 8.209e-01 -0.170 0.86495
> coarse_sedimentary -2.720e+00 8.634e-01 -3.150 0.00163 **
> felsic_intrusive -3.862e-01 8.404e-01 -0.460 0.64583
> fine_sedimentary -1.010e+00 8.795e-01 -1.149 0.25069
> rhyolite -8.420e-01 8.531e-01 -0.987 0.32365
> tuff 1.569e+01 1.008e+03 0.016 0.98758
>
> # without intercept:
> Estimate Std. Error z value Pr(>|z|)
> bd_sum -8.331e-04 1.533e-04 -5.433 5.55e-08 ***
> andesite 7.296e+00 1.404e+00 5.196 2.03e-07 ***
> clastic_volcanic 7.156e+00 1.276e+00 5.608 2.05e-08 ***
> coarse_sedimentary 4.576e+00 1.158e+00 3.951 7.77e-05 ***
> felsic_intrusive 6.910e+00 1.252e+00 5.520 3.39e-08 ***
> fine_sedimentary 6.286e+00 1.279e+00 4.916 8.83e-07 ***
> rhyolite 6.454e+00 1.289e+00 5.005 5.57e-07 ***
> tuff 2.299e+01 1.008e+03 0.023 0.982
>
>
> ... the meaning of the coefficients now makes sense (I think!), however the
> significance of each level of the 'stratifying' factor has changed
> considerably. How can I interpret that change?
>
>
>
(This is not raw R output, is it? The name of the stratifying factor
seems to be missing.)
If you add the Intercept to each term in the first analysis, as well as
to zero for andesite, you get the numbers in the second analysis, or
conversely subtract andesite from each term in in the 2nd analysis and
get those in the first.
In the first display, you can immediately see that (the intercept of)
the line for coarse_sedimentary is significantly lower than that for
andesite. Since the regression coefficient on bd_sum is assumed the
same for all strata, the difference in intercepts is actually the
difference for any fixed value of bd_sum. The other factor levels are
not significantly different from andesite.
In the second display, the significance tests are for whether the
intercept for each per-stratum line can be zero. I.e. whether the value
of the linear predictor is zero when bd_sum is zero, which in turn,
since this is a logistic model, implies that the predicted probability
is 0.5. This is probably not a relevant hypothesis at all, but all
intercepts except the last one are significantly non-zero. "Tuff"
appears to be anomalous; it has a large coefficient and a huge s.e. --
presumably the response is all 1's in this group?
> Thanks,
>
> Dylan
>
>
>
>> -p
>>
>>
>>> Much of what I am trying to get across is that you have a lot of choice
>>> as to how you specify a model to R. There has to be a default, which is
>>> chosen because it is often a good choice. It does rely on factors being
>>> coded well: the 'base level' (to quote ?contr.treatment) needs to be
>>> interpretable. And also bear in mind that the default choices of
>>> statistical software in this area almost all differ (and R's differs from
>>> S, GLIM, some ways to do this in SAS ...), so people's ideas of a 'good
>>> choice' do differ.
>>>
>>>
>>>> Setting up the model, minus the intercept term, gives me what appear to
>>>> be more meaningful coefficients. However, I am not sure if I am
>>>> interpreting the results from a linear model without an intercept term.
>>>> Model predictions from both specifications (with and without an
>>>> intercept term) are nearly identical (different by about 1E-16 in
>>>> probability space).
>>>>
>>>> Are there any gotchas to look out for when removing the intercept term
>>>> from such a model?
>>>>
>>> It is just a different parametrization of the linear predictor.
>>> Anything interpretable in terms of the predictions of the model will be
>>> unchanged. That is the crux: the default coefficients of 'b' will be
>>> log odds-ratios that are directly interpretable, and those in the
>>> per-group coding will be log-odds for a zero value of 'a'. Does a zero
>>> value of 'a' make sense?
>>>
>>>
>>>> Any guidance would be greatly appreciated.
>>>>
>>>> Cheers,
>>>>
>
>
>
>
--
O__ ---- Peter Dalgaard Øster Farimagsgade 5, Entr.B
c/ /'_ --- Dept. of Biostatistics PO Box 2099, 1014 Cph. K
(*) \(*) -- University of Copenhagen Denmark Ph: (+45) 35327918
~~~~~~~~~~ - (p.dalgaard at biostat.ku.dk) FAX: (+45) 35327907
More information about the R-help
mailing list