[R] A Tip: lm, glm, and retained cases

Wed Aug 27 11:51:04 CEST 2008

Marc Schwartz wrote:
> on 08/26/2008 07:31 PM (Ted Harding) wrote:
>   
>> On 26-Aug-08 23:49:37, hadley wickham wrote:
>>     
>>> On Tue, Aug 26, 2008 at 6:45 PM, Ted Harding
>>> <Ted.Harding at manchester.ac.uk> wrote:
>>>       
>>>> Hi Folks,
>>>> This tip is probably lurking somewhere already, but I've just
>>>> discovered it the hard way, so it is probably worth passing
>>>> on for the benefit of those who might otherwise hack their
>>>> way along the same path.
>>>>
>>>> Say (for example) you want to do a logistic regression of a
>>>> binary response Y on variables X1, X2, X3, X4:
>>>>
>>>>  GLM <- glm(Y ~ X1 + X2 + X3 + X4)
>>>>
>>>> Say there are 1000 cases in the data. Because of missing values
>>>> (NAs) in the variables, the number of complete cases retained
>>>> for the regression is, say, 600. glm() does this automatically.
>>>>
>>>> QUESTION: Which cases are they?
>>>>
>>>> You can of course find out "by hand" on the lines of
>>>>
>>>>  ix <- which( (!is.na(Y))&(!is.na(X1))&...&(!is.na(X4)) )
>>>>
>>>> but one feels that GLM already knows -- so how to get it to talk?
>>>>
>>>> ANSWER: (e.g.)
>>>>
>>>>  ix <- as.integer(names(GLM$fit))
>>>>         
>>> Alternatively, you can use:
>>>
>>> attr(GLM$model, "na.action")
>>>
>>> Hadley
>>>       
>> Thanks! I can see that it works -- though understanding how
>> requires a deeper knowledge of "R internals". However, since
>> you've approached it from that direction, simply
>>
>>   GLM$model
>>
>> is a dataframe of the retained cases (with corresponding
>> row-names), all variables at once, and that is possibly an
>> even simpler approach!
>>     
>
> Or just use:
>
>    model.frame(ModelObject)
>
> as the extractor function...  :-)
>
> Another 'a priori' approach would be to use na.omit() or one of its
> brethren on the data frame before creating the model. Which function is
> used depends upon how 'na.action' is set.
>
> The returned value, or more specifically the 'na.action' attribute as
> appropriate, would yield information similar to Hadley's approach
> relative to which records were excluded.
>
> For example, using the simple data frame in ?na.omit:
>
> DF <- data.frame(x = c(1, 2, 3), y = c(0, 10, NA))
>
>   
>> DF
>>     
>   x  y
> 1 1  0
> 2 2 10
> 3 3 NA
>
> DF.na <- na.omit(DF)
>
>   
>> DF.na
>>     
>   x  y
> 1 1  0
> 2 2 10
>
>   
>> attr(DF.na, "na.action")
>>     
> 3
> 3
> attr(,"class")
> [1] "omit"
>
>
> So you can see that record 3 was removed from the original data frame
> due to the NA for 'y'.
>   
Also notice the possibility of

(g)lm(....., na.action=na.exclude)

as in

library(ISwR); attach(thuesen)
fit <- lm(short.velocity ~ blood.glucose, na.action=na.exclude)
which(is.na(fitted(fit))) # 16

This is often recommendable anyway, e.g. in case you want to plot
residuals against original predictors.

-- 
   O__  ---- Peter Dalgaard             Øster Farimagsgade 5, Entr.B
  c/ /'_ --- Dept. of Biostatistics     PO Box 2099, 1014 Cph. K
 (*) \(*) -- University of Copenhagen   Denmark      Ph:  (+45) 35327918
~~~~~~~~~~ - (p.dalgaard at biostat.ku.dk)              FAX: (+45) 35327907