[R] Remark on tapply().

Karl Ove Hufthammer karl at huftis.org
Tue Dec 1 08:32:02 CET 2009

On Tue, 1 Dec 2009 14:10:17 +1300 Rolf Turner <r.turner at auckland.ac.nz> 
> Consider the following:
>  > set.seed(42)
>  > ff <- factor(sample(c(1,3,5),42,TRUE),levels=1:5)
>  > x <- runif(42)
>  > tapply(x,ff,sum)
>         1        2        3        4        5
> 3.675436       NA 7.519675       NA 9.094210
> I got bitten by those NAs in the result of tapply().  Effectively
> one is summing over the empty set, and consequently (according to what
> I learned as a child) I thought that the result would be 0.

Note that this *is* documented on the help page for 'tapply', actually, 
in its description:

  Apply a function to each cell of a ragged array, that is to each
  (non-empty) group of values given by a unique combination of the 
  levels of certain factors. 

Basically (ignoring some details) 'tapply' does:

  sapply(split(x, ff), sum)

Which actually *does* give you 0 for level 2 and 4. The reason is (again 
ignoring some details) 'tapply' does:

  sapply(split(x, as.numeric(ff)), sum)

which only looks at the actual values of 'ff', not its levels.

Note that value 'zero' is not a special case. For instance,

  sapply(split(x, ff), prod)

gives the 'empty product', i.e., 1.

Exercise to the reader:

Note that
sapply(split(x, ff, drop=TRUE), sum)
gives you the values of (just) the non-empty levels.

Now, why does
  sapply(split(x, ff), sum, drop=TRUE)
give the wrong value (1) for these levels, while
  sapply(split(x, ff), sum, drop=FALSE)
gives the the correct value?

(The answer should be fairly obvious, but it's an easy mistake to make.)

Karl Ove Hufthammer

More information about the R-help mailing list