[R] fitting in logistic model

Tue Jun 30 17:44:58 CEST 2009

On 30-Jun-09 14:52:20, Marc Schwartz wrote:
> On Jun 30, 2009, at 4:54 AM, Renzi Fabrizio wrote:
> 
>> I would like to know how R computes the probability of an event
>> in a logistic model (P(y=1)) from the score s, linear combination
>> of x and beta.
>>
>> I noticed that there are differences (small, less than e-16) between  
>> the fitting values automatically computed in the glm procedure by R,
>> and the values "manually" computed by me applying the reverse
>> formula p=e^s/(1+e^s); moreover I noticed  that the minimum value
>> of the fitting values in my estimation is 2.220446e-16, and there
>> are many observation with this probability (instead the minimum
>> value obtained by "manually" estimation is 2.872636e-152).
> 
> It would be helpful to see at least a subset of the output from your  
> model and your manual computations so that we can at least visually  
> compare the results to see where the differences may be.
> 
> The model object returned from using glm() will contain both the  
> linear predictors on the link scale (model$linear.predictors) and
> the fitted values (model$fitted.values). The latter can be accessed
> using the fitted() extractor function.
> 
> To use an example, let's create a simple LR model using the infert  
> data set as referenced in ?glm.
> 
> model1 <- glm(case ~ spontaneous + induced, data = infert,
>               family = binomial())
> 
> > model1
> Call:  glm(formula = case ~ spontaneous + induced,
             family = binomial(), data = infert)
> 
> Coefficients:
> (Intercept)  spontaneous      induced
>      -1.7079       1.1972       0.4181
> 
> Degrees of Freedom: 247 Total (i.e. Null);  245 Residual
> Null Deviance:            316.2
> Residual Deviance: 279.6      AIC: 285.6
> 
># Get the coefficients
>  > coef(model1)
> (Intercept) spontaneous     induced
>   -1.7078601   1.1972050   0.4181294
> 
># get the linear predictor values
># log odds scale for binomial glm
>  > head(model1$linear.predictors)
>            1           2           3           4           5          
> 6
>   1.10467939 -1.28973068 -0.87160128 -0.87160128 -0.09252564 
> 0.32560375
> 
> You can also get the above by using the coefficients and the model  
> matrix for comparison:
># the full set of 248
># coef(model1) %*% t(model.matrix(model1))
> > head(as.vector(coef(model1) %*% t(model.matrix(model1))))
> [1]  1.10467939 -1.28973068 -0.87160128 -0.87160128 -0.09252564   
> 0.32560375
> 
># get fitted response values (predicted probs)
>  > head(fitted(model1))
>          1         2         3         4         5         6
> 0.7511359 0.2158984 0.2949212 0.2949212 0.4768851 0.5806893
> 
> We can also get the fitted values from the linear predictor values
> by using:
> 
> LP <- model1$linear.predictors
> > head(exp(LP) / (1 + exp(LP)))
>         1         2         3         4         5         6
> 0.7511359 0.2158984 0.2949212 0.2949212 0.4768851 0.5806893
> 
> You can also get the above by using the predict.glm() function with  
> type == "response". The default type of "link" will get you the linear 
> predictor values as above. predict.glm() would typically be used to  
> generate predictions using the model on new data.
> 
>  > head(predict(model1, type = "response"))
>          1         2         3         4         5         6
> 0.7511359 0.2158984 0.2949212 0.2949212 0.4768851 0.5806893
> 
> In glm.fit(), which is the workhorse function in glm(), the fitted  
> values returned in the model object are actually computed by using
> the inverse link function for the family passed to glm():
> 
>  > binomial()$linkinv
> function (eta)
> .Call("logit_linkinv", eta, PACKAGE = "stats")
> <environment: 0x171c8b10>
> 
> Thus:
>  > head(binomial()$linkinv(model1$linear.predictors))
>          1         2         3         4         5         6
> 0.7511359 0.2158984 0.2949212 0.2949212 0.4768851 0.5806893
> 
> So those are the same values as we saw above using the other methods.  
> So, all is consistent across the various methods.
> 
> Perhaps the above provides some insights for you into how R does some  
> of these things and also to point out, as is frequently the case with  
> R, there is more than one way to get the same result.
> 
> HTH,
> Marc Schwartz

An interesting and informative reply, Marc; but I think it misses
the point of Fabrizio's query. I think Fabrizio's point is the
following:

  set.seed(54321)
  X <- sort(rnorm(100))
  a0 <- 1.0 ; b0 <- 0.5
  Y <- 1*(runif(100) < a0 + b0*X)
  GLM <- glm(Y~X,family=binomial)
  C <- GLM$coef
  C
  # (Intercept)           X 
  #    2.726966    2.798429 
  a1 <- C[1] ; b1 <- C[2]
  Fit0 <- GLM$fit
  S <- a1 + b1*X
  Fit1 <- exp(S)/(1+exp(S))
  max(abs(Fit1 - Fit0))
  # [1] 1.110223e-16

This discrepancy is of course, in magnitude, a typical "rounding
error". But the fact that it occurred indicates that when glm()
computed the fitted values it did not do so by using the fitted
coefficients GLM$coef, then creating the fitted score (linear
predictor) S (as above), then applyhing to this the "inverse link"
exp(S)/(1+exp(S)), since doing that would replicate the above
calculation and should yield identical results.

In fact, a bit more probing to GLM shows that there is already
a discrepancy at the "score" level:

  S0 <- GLM$linear.predictors
  max(abs(S0-S))
  # [1] 8.881784e-16

so S0 has not been calculated by applying GLM$coef to X. Also, if
we apply the inverse link to S0, then:

  max(abs(Fit0 - exp(S0)/(1+exp(S0))))
  # [1] 1.110223e-16

which is the same discrepancy as between Fit1 and Fit0.

  max(abs(Fit1 - exp(S0)/(1+exp(S0))))
  # [1] 1.110223e-16

the same again!! 

Hence, if I have understood him aright, Fabrizio's question.

Hoping this helps,
Ted.

--------------------------------------------------------------------
E-Mail: (Ted Harding) <Ted.Harding at manchester.ac.uk>
Fax-to-email: +44 (0)870 094 0861
Date: 30-Jun-09                                       Time: 16:44:56
------------------------------ XFMail ------------------------------