[R] using poly in a linear regression in the presence of NA f ails (despite subsetting them out)

Prof Brian Ripley ripley at stats.ox.ac.uk
Tue Feb 15 08:19:18 CET 2005


Andy,

I don't think it is a bug.  The problem is that poly(x, 2) depends on the 
possible set of x values, and so needs to know all of them, unlike 
e.g. log(x) which is observation-by-observation.  Silently omitting 
missing values is not a good idea in such cases, especially if the values 
are missing in other variables (which is what na.action is likely to do).

I would say models with poly, ns, bs etc are inadvisable in the presence 
of missing values in their argument.  We could make poly() give an 
informative message, though.

Brian


On Mon, 14 Feb 2005, Liaw, Andy wrote:

> This smells like a bug to me.  The error is triggered by the line:
>
>   variables <- eval(predvars, data, env)
>
> inside model.frame.default().  At that point, na.action has not been
> applied, so poly() ended being called on data that still contains missing
> values.  The qr() that issued the error is for generating the orthogonal
> basis when evaluating poly(), not for fitting the linear model itself.
>
> Essentially, calling
>
>  model.frame(y ~ poly(x, 2), data=data.frame(x=c(NA, 1:3), y=rnorm(4)),
>              na.action=na.omit)
>
> would show the same error.  The obvious workaround is to omit cases with NAs
> before calling lm().
>
> Andy
>
>> From: Markus Jäntti
>>
>> I ran into a to me surprising result on running lm with an orthogonal
>> polynomial among the predictors.
>>
>> The lm command resulted in
>>
>> Error in qr(X) : NA/NaN/Inf in foreign function call (arg 1)
>> Error during wrapup:
>>
>> despite my using a "subset" in the call to get rid of NA's.
>>
>> poly is apparently evaluated before any NA's are subsetted out
>> of the data.
>>
>> Example code (attached to this e-mail as as a script):
>> > ## generate some data
>> > n <- 50
>> > x <- runif(n)
>> > a0 <- 10
>> > a1 <- .5
>> > sigma.e <- 1
>> > y <- a0 + a1*x + rnorm(n)*sigma.e
>> > tmp.d <- data.frame(y, x)
>> > rm(list=c("n", "x", "a0", "a1", "sigma.e", "y"))
>> >
>> > print(lm.1 <- lm(y ~ poly(x, 2), data = tmp.d)
>> +
>> + ## now make a few NA's
>> +
>> + tmp.d$x[1:2] <- rep(NA, 2)
>> Error: syntax error
>> Error during wrapup:
>> >
>> > ## this fails, just as it should
>> > print(lm.1 <- lm(y ~ poly(x, 2), data = tmp.d))
>>
>> Call:
>> lm(formula = y ~ poly(x, 2), data = tmp.d)
>>
>> Coefficients:
>> (Intercept)  poly(x, 2)1  poly(x, 2)2
>>       10.380       -0.242       -1.441
>>
>> >
>> > ## these also fail, but should not?
>> >
>> > print(lm.2 <- lm(y ~ poly(x, 2), data = tmp.d, subset = !is.na(x)))
>>
>> Call:
>> lm(formula = y ~ poly(x, 2), data = tmp.d, subset = !is.na(x))
>>
>> Coefficients:
>> (Intercept)  poly(x, 2)1  poly(x, 2)2
>>       10.380       -0.242       -1.441
>>
>> > print(lm.3 <- lm(y ~ poly(x, 2), data = tmp.d, na.action =
>> na.omit))
>>
>> Call:
>> lm(formula = y ~ poly(x, 2), data = tmp.d, na.action = na.omit)
>>
>> Coefficients:
>> (Intercept)  poly(x, 2)1  poly(x, 2)2
>>       10.380       -0.242       -1.441
>>
>> >
>> > ## but this works
>> >
>> > print(lm.3 <- lm(y ~ poly(x, 2), data = subset(tmp.d, subset =
>> !is.na(x))))
>>
>> Call:
>> lm(formula = y ~ poly(x, 2), data = subset(tmp.d, subset = !is.na(x)))
>>
>> Coefficients:
>> (Intercept)  poly(x, 2)1  poly(x, 2)2
>>       10.380       -0.242       -1.441
>>
>> --------------------
>>
>> The documentation of lm is *not* misleading at this point, saying that
>>
>> subset 	an optional vector specifying a subset of
>> observations to be
>> used in the fitting process.
>>
>> which implies that data are subsetted once lm.fit is called.
>> All the same, this behavior is a little unexpected to me.
>> Is it to be considered a feature, that is, does it produce beneficial
>> side effects which explain why it works as it does?
>>
>> Regards,
>>
>> Markus
>>
>> I am running R on a Debian testing system with kernel 2.6.10 and
>>
>> > version
>>           _
>> platform i386-pc-linux-gnu
>> arch     i386
>> os       linux-gnu
>> system   i386, linux-gnu
>> status
>> major    2
>> minor    0.1
>> year     2004
>> month    11
>> day      15
>> language R
>> --
>> Markus Jantti
>> Abo Akademi University
>> markus.jantti at iki.fi
>> http://www.iki.fi/~mjantti
>>
>
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
>

-- 
Brian D. Ripley,                  ripley at stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595


More information about the R-help mailing list