[R] lm() silently drops NAs

Tue Jul 26 22:26:11 CEST 2016

On Tue, Jul 26, 2016 at 3:24 AM, Martin Maechler
<maechler at stat.math.ethz.ch> wrote:
> I have been asked (in private)

Martin was very polite to not share my name, but it was me :)

>     > Hi Martin,
>
>     y <- c(1, 2, 3, NA, 4)
>     x <- c(1, 2, 2, 1, 1)
>
>     t.test(y ~ x)
>     lm(y ~ x)
>
>     > Normally, most R functions follow the principle that
>     > "missings should never silently go missing". Do you have
>     > any background on why these functions (and most model/test
>     > function in stats, I presume) silently drop missing values?
>
> And I think, the issue and an answer are important enough to be
> public, hence this posting to R-help :
>
> First note that in some sense it is not true that lm(y ~ x) silently drops
> NAs: Everybody who is taught about lm() is taught to look at
>    summary( fm )  where  fm <- lm(y ~ x)

Good point - unfortunately the message was a bit subtle for me, and
don't remember being taught to look for it :(

> and that (for the above case)  "says"
>
>  ' (1 observation deleted due to missingness) '
>
> and so is not entirely silent.
>
>
> This goes all back to the idea of having an 'na.action' argument
> which may be a function (a very good idea, "functional
> programming" in a double sense!)... which Chambers et al
> introduced in The White Book (*1) and which I think to remember
> was quite a revolutionary idea; at least I had liked that very much
> once I understood the beauty of passing functions as arguments
> to other functions.
> One problem already back then has been that we already had the
> ---much more primitive but often sufficient--- standard of an
> 'na.rm = FALSE' (i.e. default FALSE) argument.
>
> This has been tought in all good classes/course about statistical
> modeling with S and R ever since ... I had hoped ....
> (but it seems I was too optimistic, .. or many students have too
>  quickly forgotten what they were taught ..)
> Notably the white book itself, and the MASS (*2) book do teach
> this.. though possibly not loudly enough.
>
> Two more decisions about this were made back then, as well:
>
>   1) The default for na.action to be na.omit  (==> "silently dropping")
>
>   2) na.action being governed by  options(na.action = ..)
>
> '1)' may have been mostly "historical": I think it had been the behavior of
> other "main stream" statistical packages back then (and now?) *and*
> possibly more importantly, notably with the later (than white book = "S3")
> advent of na.replace, you did want to keep the missing in your
> data frame, for later analysis; e.g. drawing (-> "gaps" in
> plots) so the NA *were* carried along and would not be
> forgotten, something very important in most case.s.
>
> and '2)' is something I personally no longer like very
> much, as it is "killing" the functional paradigm.
> OTOH, it has to be said in favor of that "session wide" / "global" setting
>       options(na.action = *)
> that indeed it depends on the specific data analysis, or even
> the specific *phase* of a data analysis, *what* behavior of NA
> treatment is desired.... and at the time it was thought smart
> that all methods (also functions "deep down" called by
> user-called functtions) would automatically use the "currently
> desired" NA handling.
>
> There have been recommendations (I don't know exactly where and
> by whom) to always set
>
>    options(na.action = na.fail)
>
> in your global .First() or nowadays rather your  Rprofile, and
> I assume that some of the CRAN packages and some of the "teaching
> setups" would do that (and if you do that, the lm() and t.test()
> calls above give an error).

I think that's a bit too strict for me, so I wrote my own:

na.warn <- function(object, ...) {
  missing <- complete.cases(object)
  if (any(missing)) {
    warning("Dropping ", sum(missing), " rows with missing values",
call. = FALSE)
  }

  na.exclude(object, ...)
}

That gives me the behaviour I want:

options(na.action = na.warn)

y <- c(1, 2, 3, NA, 4)
x <- c(1, 2, 2, 1, 1)
mod <- lm(y ~ x)
#> Warning: Dropping 4 rows with missing values

predict(mod)
#>   1   2   3   4   5
#> 2.5 2.5 2.5  NA 2.5
resid(mod)
#>    1    2    3    4    5
#> -1.5 -0.5  0.5   NA  1.5

To me, this would be the most sensible default behaviour, but I
realise it's too late to change without breaking many existing
expectations.

On a related note, I've never really understood why it's called
na.exclude - from my perspective it causes the _inclusion_ of missing
values in the predictions/residuals.

Thanks for the (as always!) informative response, Martin.

Hadley

-- 
http://hadley.nz