[R] A Tip: lm, glm, and retained cases
Peter Dalgaard
P.Dalgaard at biostat.ku.dk
Wed Aug 27 11:51:04 CEST 2008
Marc Schwartz wrote:
> on 08/26/2008 07:31 PM (Ted Harding) wrote:
>
>> On 26-Aug-08 23:49:37, hadley wickham wrote:
>>
>>> On Tue, Aug 26, 2008 at 6:45 PM, Ted Harding
>>> <Ted.Harding at manchester.ac.uk> wrote:
>>>
>>>> Hi Folks,
>>>> This tip is probably lurking somewhere already, but I've just
>>>> discovered it the hard way, so it is probably worth passing
>>>> on for the benefit of those who might otherwise hack their
>>>> way along the same path.
>>>>
>>>> Say (for example) you want to do a logistic regression of a
>>>> binary response Y on variables X1, X2, X3, X4:
>>>>
>>>> GLM <- glm(Y ~ X1 + X2 + X3 + X4)
>>>>
>>>> Say there are 1000 cases in the data. Because of missing values
>>>> (NAs) in the variables, the number of complete cases retained
>>>> for the regression is, say, 600. glm() does this automatically.
>>>>
>>>> QUESTION: Which cases are they?
>>>>
>>>> You can of course find out "by hand" on the lines of
>>>>
>>>> ix <- which( (!is.na(Y))&(!is.na(X1))&...&(!is.na(X4)) )
>>>>
>>>> but one feels that GLM already knows -- so how to get it to talk?
>>>>
>>>> ANSWER: (e.g.)
>>>>
>>>> ix <- as.integer(names(GLM$fit))
>>>>
>>> Alternatively, you can use:
>>>
>>> attr(GLM$model, "na.action")
>>>
>>> Hadley
>>>
>> Thanks! I can see that it works -- though understanding how
>> requires a deeper knowledge of "R internals". However, since
>> you've approached it from that direction, simply
>>
>> GLM$model
>>
>> is a dataframe of the retained cases (with corresponding
>> row-names), all variables at once, and that is possibly an
>> even simpler approach!
>>
>
> Or just use:
>
> model.frame(ModelObject)
>
> as the extractor function... :-)
>
> Another 'a priori' approach would be to use na.omit() or one of its
> brethren on the data frame before creating the model. Which function is
> used depends upon how 'na.action' is set.
>
> The returned value, or more specifically the 'na.action' attribute as
> appropriate, would yield information similar to Hadley's approach
> relative to which records were excluded.
>
> For example, using the simple data frame in ?na.omit:
>
> DF <- data.frame(x = c(1, 2, 3), y = c(0, 10, NA))
>
>
>> DF
>>
> x y
> 1 1 0
> 2 2 10
> 3 3 NA
>
> DF.na <- na.omit(DF)
>
>
>> DF.na
>>
> x y
> 1 1 0
> 2 2 10
>
>
>> attr(DF.na, "na.action")
>>
> 3
> 3
> attr(,"class")
> [1] "omit"
>
>
> So you can see that record 3 was removed from the original data frame
> due to the NA for 'y'.
>
Also notice the possibility of
(g)lm(....., na.action=na.exclude)
as in
library(ISwR); attach(thuesen)
fit <- lm(short.velocity ~ blood.glucose, na.action=na.exclude)
which(is.na(fitted(fit))) # 16
This is often recommendable anyway, e.g. in case you want to plot
residuals against original predictors.
--
O__ ---- Peter Dalgaard Øster Farimagsgade 5, Entr.B
c/ /'_ --- Dept. of Biostatistics PO Box 2099, 1014 Cph. K
(*) \(*) -- University of Copenhagen Denmark Ph: (+45) 35327918
~~~~~~~~~~ - (p.dalgaard at biostat.ku.dk) FAX: (+45) 35327907
More information about the R-help
mailing list