[R] lm() silently drops NAs
Hadley Wickham
h.wickham at gmail.com
Tue Jul 26 22:26:11 CEST 2016
On Tue, Jul 26, 2016 at 3:24 AM, Martin Maechler
<maechler at stat.math.ethz.ch> wrote:
> I have been asked (in private)
Martin was very polite to not share my name, but it was me :)
> > Hi Martin,
>
> y <- c(1, 2, 3, NA, 4)
> x <- c(1, 2, 2, 1, 1)
>
> t.test(y ~ x)
> lm(y ~ x)
>
> > Normally, most R functions follow the principle that
> > "missings should never silently go missing". Do you have
> > any background on why these functions (and most model/test
> > function in stats, I presume) silently drop missing values?
>
> And I think, the issue and an answer are important enough to be
> public, hence this posting to R-help :
>
> First note that in some sense it is not true that lm(y ~ x) silently drops
> NAs: Everybody who is taught about lm() is taught to look at
> summary( fm ) where fm <- lm(y ~ x)
Good point - unfortunately the message was a bit subtle for me, and
don't remember being taught to look for it :(
> and that (for the above case) "says"
>
> ' (1 observation deleted due to missingness) '
>
> and so is not entirely silent.
>
>
> This goes all back to the idea of having an 'na.action' argument
> which may be a function (a very good idea, "functional
> programming" in a double sense!)... which Chambers et al
> introduced in The White Book (*1) and which I think to remember
> was quite a revolutionary idea; at least I had liked that very much
> once I understood the beauty of passing functions as arguments
> to other functions.
> One problem already back then has been that we already had the
> ---much more primitive but often sufficient--- standard of an
> 'na.rm = FALSE' (i.e. default FALSE) argument.
>
> This has been tought in all good classes/course about statistical
> modeling with S and R ever since ... I had hoped ....
> (but it seems I was too optimistic, .. or many students have too
> quickly forgotten what they were taught ..)
> Notably the white book itself, and the MASS (*2) book do teach
> this.. though possibly not loudly enough.
>
> Two more decisions about this were made back then, as well:
>
> 1) The default for na.action to be na.omit (==> "silently dropping")
>
> 2) na.action being governed by options(na.action = ..)
>
> '1)' may have been mostly "historical": I think it had been the behavior of
> other "main stream" statistical packages back then (and now?) *and*
> possibly more importantly, notably with the later (than white book = "S3")
> advent of na.replace, you did want to keep the missing in your
> data frame, for later analysis; e.g. drawing (-> "gaps" in
> plots) so the NA *were* carried along and would not be
> forgotten, something very important in most case.s.
>
> and '2)' is something I personally no longer like very
> much, as it is "killing" the functional paradigm.
> OTOH, it has to be said in favor of that "session wide" / "global" setting
> options(na.action = *)
> that indeed it depends on the specific data analysis, or even
> the specific *phase* of a data analysis, *what* behavior of NA
> treatment is desired.... and at the time it was thought smart
> that all methods (also functions "deep down" called by
> user-called functtions) would automatically use the "currently
> desired" NA handling.
>
> There have been recommendations (I don't know exactly where and
> by whom) to always set
>
> options(na.action = na.fail)
>
> in your global .First() or nowadays rather your Rprofile, and
> I assume that some of the CRAN packages and some of the "teaching
> setups" would do that (and if you do that, the lm() and t.test()
> calls above give an error).
I think that's a bit too strict for me, so I wrote my own:
na.warn <- function(object, ...) {
missing <- complete.cases(object)
if (any(missing)) {
warning("Dropping ", sum(missing), " rows with missing values",
call. = FALSE)
}
na.exclude(object, ...)
}
That gives me the behaviour I want:
options(na.action = na.warn)
y <- c(1, 2, 3, NA, 4)
x <- c(1, 2, 2, 1, 1)
mod <- lm(y ~ x)
#> Warning: Dropping 4 rows with missing values
predict(mod)
#> 1 2 3 4 5
#> 2.5 2.5 2.5 NA 2.5
resid(mod)
#> 1 2 3 4 5
#> -1.5 -0.5 0.5 NA 1.5
To me, this would be the most sensible default behaviour, but I
realise it's too late to change without breaking many existing
expectations.
On a related note, I've never really understood why it's called
na.exclude - from my perspective it causes the _inclusion_ of missing
values in the predictions/residuals.
Thanks for the (as always!) informative response, Martin.
Hadley
--
http://hadley.nz
More information about the R-help
mailing list