[R] R: fitting in logistic model
Renzi Fabrizio
frenzi at fgo.bcc.it
Wed Jul 1 09:42:51 CEST 2009
Thank you very much for your answer.
Mark hit the point of my query. Now we need somebody who knows how R
computes the fitting values, and why it does not use the inverse link...
In my humble opinion I think that R uses a kind of interpolation, using
some standard points (with the minimum value 2.220446e-16).. but it is
just a surmise.
I think it would be useful to investigate the function glm, using the
editor; I tried to do it, but I didn't understand anything...
Fabrizio Renzi
-----Messaggio originale-----
Da: Ted.Harding at manchester.ac.uk [mailto:Ted.Harding at manchester.ac.uk]
Inviato: 30 June 2009 19:54
A: Marc Schwartz
Cc: r-help at r-project.org; Renzi Fabrizio
Oggetto: Re: [R] fitting in logistic model
On 30-Jun-09 17:41:20, Marc Schwartz wrote:
> On Jun 30, 2009, at 10:44 AM, Ted Harding wrote:
>
>>
>> On 30-Jun-09 14:52:20, Marc Schwartz wrote:
>>> On Jun 30, 2009, at 4:54 AM, Renzi Fabrizio wrote:
>>>
>>>> I would like to know how R computes the probability of an event
>>>> in a logistic model (P(y=1)) from the score s, linear combination
>>>> of x and beta.
>>>>
>>>> I noticed that there are differences (small, less than e-16)
between
>>>> the fitting values automatically computed in the glm procedure by
R,
>>>> and the values "manually" computed by me applying the reverse
>>>> formula p=e^s/(1+e^s); moreover I noticed that the minimum value
>>>> of the fitting values in my estimation is 2.220446e-16, and there
>>>> are many observation with this probability (instead the minimum
>>>> value obtained by "manually" estimation is 2.872636e-152).
>>>
>>> It would be helpful to see at least a subset of the output from your
>>> model and your manual computations so that we can at least visually
>>> compare the results to see where the differences may be.
>>>
>>> The model object returned from using glm() will contain both the
>>> linear predictors on the link scale (model$linear.predictors) and
>>> the fitted values (model$fitted.values). The latter can be accessed
>>> using the fitted() extractor function.
>>>
>>> To use an example, let's create a simple LR model using the infert
>>> data set as referenced in ?glm.
>>>
>>> model1 <- glm(case ~ spontaneous + induced, data = infert,
>>> family = binomial())
>>>
>>>> model1
>>> Call: glm(formula = case ~ spontaneous + induced,
>> family = binomial(), data = infert)
>>>
>>> Coefficients:
>>> (Intercept) spontaneous induced
>>> -1.7079 1.1972 0.4181
>>>
>>> Degrees of Freedom: 247 Total (i.e. Null); 245 Residual
>>> Null Deviance: 316.2
>>> Residual Deviance: 279.6 AIC: 285.6
>>>
>>> # Get the coefficients
>>>> coef(model1)
>>> (Intercept) spontaneous induced
>>> -1.7078601 1.1972050 0.4181294
>>>
>>> # get the linear predictor values
>>> # log odds scale for binomial glm
>>>> head(model1$linear.predictors)
>>> 1 2 3 4 5
>>> 6
>>> 1.10467939 -1.28973068 -0.87160128 -0.87160128 -0.09252564
>>> 0.32560375
>>>
>>> You can also get the above by using the coefficients and the model
>>> matrix for comparison:
>>> # the full set of 248
>>> # coef(model1) %*% t(model.matrix(model1))
>>>> head(as.vector(coef(model1) %*% t(model.matrix(model1))))
>>> [1] 1.10467939 -1.28973068 -0.87160128 -0.87160128 -0.09252564
>>> 0.32560375
>>>
>>> # get fitted response values (predicted probs)
>>>> head(fitted(model1))
>>> 1 2 3 4 5 6
>>> 0.7511359 0.2158984 0.2949212 0.2949212 0.4768851 0.5806893
>>>
>>> We can also get the fitted values from the linear predictor values
>>> by using:
>>>
>>> LP <- model1$linear.predictors
>>>> head(exp(LP) / (1 + exp(LP)))
>>> 1 2 3 4 5 6
>>> 0.7511359 0.2158984 0.2949212 0.2949212 0.4768851 0.5806893
>>>
>>> You can also get the above by using the predict.glm() function with
>>> type == "response". The default type of "link" will get you the
>>> linear
>>> predictor values as above. predict.glm() would typically be used to
>>> generate predictions using the model on new data.
>>>
>>>> head(predict(model1, type = "response"))
>>> 1 2 3 4 5 6
>>> 0.7511359 0.2158984 0.2949212 0.2949212 0.4768851 0.5806893
>>>
>>> In glm.fit(), which is the workhorse function in glm(), the fitted
>>> values returned in the model object are actually computed by using
>>> the inverse link function for the family passed to glm():
>>>
>>>> binomial()$linkinv
>>> function (eta)
>>> .Call("logit_linkinv", eta, PACKAGE = "stats")
>>> <environment: 0x171c8b10>
>>>
>>> Thus:
>>>> head(binomial()$linkinv(model1$linear.predictors))
>>> 1 2 3 4 5 6
>>> 0.7511359 0.2158984 0.2949212 0.2949212 0.4768851 0.5806893
>>>
>>> So those are the same values as we saw above using the other
methods.
>>> So, all is consistent across the various methods.
>>>
>>> Perhaps the above provides some insights for you into how R does
some
>>> of these things and also to point out, as is frequently the case
with
>>> R, there is more than one way to get the same result.
>>>
>>> HTH,
>>> Marc Schwartz
>>
>> An interesting and informative reply, Marc; but I think it misses
>> the point of Fabrizio's query. I think Fabrizio's point is the
>> following:
>>
>> set.seed(54321)
>> X <- sort(rnorm(100))
>> a0 <- 1.0 ; b0 <- 0.5
>> Y <- 1*(runif(100) < a0 + b0*X)
>> GLM <- glm(Y~X,family=binomial)
>> C <- GLM$coef
>> C
>> # (Intercept) X
>> # 2.726966 2.798429
>> a1 <- C[1] ; b1 <- C[2]
>> Fit0 <- GLM$fit
>> S <- a1 + b1*X
>> Fit1 <- exp(S)/(1+exp(S))
>> max(abs(Fit1 - Fit0))
>> # [1] 1.110223e-16
>>
>> This discrepancy is of course, in magnitude, a typical "rounding
>> error". But the fact that it occurred indicates that when glm()
>> computed the fitted values it did not do so by using the fitted
>> coefficients GLM$coef, then creating the fitted score (linear
>> predictor) S (as above), then applyhing to this the "inverse link"
>> exp(S)/(1+exp(S)), since doing that would replicate the above
>> calculation and should yield identical results.
>>
>> In fact, a bit more probing to GLM shows that there is already
>> a discrepancy at the "score" level:
>>
>> S0 <- GLM$linear.predictors
>> max(abs(S0-S))
>> # [1] 8.881784e-16
>>
>> so S0 has not been calculated by applying GLM$coef to X. Also, if
>> we apply the inverse link to S0, then:
>>
>> max(abs(Fit0 - exp(S0)/(1+exp(S0))))
>> # [1] 1.110223e-16
>>
>> which is the same discrepancy as between Fit1 and Fit0.
>>
>> max(abs(Fit1 - exp(S0)/(1+exp(S0))))
>> # [1] 1.110223e-16
>>
>> the same again!!
>>
>> Hence, if I have understood him aright, Fabrizio's question.
>>
>> Hoping this helps,
>> Ted.
>
>
> Ted,
>
> What OS and version of R are you using?
>
> I am on OSX 10.5.7 and using 32 bit:
>
> R version 2.9.1 Patched (2009-06-30 r48877)
>
> The reason that I ask is that using your data and model, I get:
>
> > max(abs(Fit1 - Fit0))
> [1] 0
>
> > max(abs(S0-S))
> [1] 0
>
> > max(abs(Fit0 - exp(S0)/(1+exp(S0))))
> [1] 0
>
>
> Truth be told, I was initially leaning in the direction of a rounding
> error, but first wanted to be sure that there was not some underlying
> methodological error that Fabrizio was encountering. Hence the focus
> of my reply was to present a level of consistency in getting the
> results using multiple methods.
>
> When I saw your post, I started thinking that my example, which did
> not have a numeric (eg. non-integer) covariate, might have been too
> simple and missed introducing an additional source of rounding error,
> but then I could not replicate your findings either...
>
> Thanks,
> Marc
I'm using Debian Linux:
Linux deb 2.6.18-6-686 #1 SMP Tue May 5 00:40:20 UTC 2009 i686
GNU/Linux
(32-bit as far as 5 know) and R version:
version.string R version 2.9.0 (2009-04-17)
After posting, I began to suspect that the compiled code which does
the actual comnputations may be working to higher precision internally,
but rounding to R's:
$double.eps
[1] 2.220446e-16
$double.neg.eps
[1] 1.110223e-16
before attaching the results to the return-list for glm() (compare
the numerical discrepancies which I found above).
However, that it only surmise!
Ted.
--------------------------------------------------------------------
E-Mail: (Ted Harding) <Ted.Harding at manchester.ac.uk>
Fax-to-email: +44 (0)870 094 0861
Date: 30-Jun-09 Time: 18:54:19
------------------------------ XFMail ------------------------------
AVVISO DI RISERVATEZZA
Il testo e gli eventuali documenti trasmessi contengono informazioni riservate al destinatario indicato.
La seguente e-mail e confidenziale e la sua riservatezza e tutelata legalmente dalle normative vigenti.
La lettura, copia od altro uso non autorizzato o qualsiasi altra azione derivante dalla conoscenza di
queste informazioni sono rigorosamente vietate. Se si ritiene di non essere il destinatario di questa mail,
o se si e ricevuto questa mail per errore, si prega di darne immediata comunicazione al mittente e di
provvedere immediatamente alla sua distruzione.
PRIVACY NOTICE
The information contained in this transmittal, including any attachments hereto, are confidential and
privileged, and intended solely for the specified addressee(s). This e-mail has a confidential nature which
is protected by the Italian law. Moreover, the recipient(s) may not disclose, forward, or copy this e-mail
or attachments, or any portion thereof, or permit the use of this information, by anyone not entitled to it,
or in a way that may be damaging to the sender. If you are not the intended addressee, or if you receive
this message by error, please notify the sender and delete this information from your computer.
More information about the R-help
mailing list