Karl Ove Hufthammer
karl at huftis.org
Tue Dec 1 08:32:02 CET 2009
On Tue, 1 Dec 2009 14:10:17 +1300 Rolf Turner <r.turner at auckland.ac.nz>
wrote:
> Consider the following:
>
> > set.seed(42)
> > ff <- factor(sample(c(1,3,5),42,TRUE),levels=1:5)
> > x <- runif(42)
> > tapply(x,ff,sum)
> 1 2 3 4 5
> 3.675436 NA 7.519675 NA 9.094210
>
> I got bitten by those NAs in the result of tapply(). Effectively
> one is summing over the empty set, and consequently (according to what
> I learned as a child) I thought that the result would be 0.
Note that this *is* documented on the help page for 'tapply', actually,
in its description:
Apply a function to each cell of a ragged array, that is to each
(non-empty) group of values given by a unique combination of the
levels of certain factors.
Basically (ignoring some details) 'tapply' does:
sapply(split(x, ff), sum)
Which actually *does* give you 0 for level 2 and 4. The reason is (again
ignoring some details) 'tapply' does:
sapply(split(x, as.numeric(ff)), sum)
which only looks at the actual values of 'ff', not its levels.
Note that value 'zero' is not a special case. For instance,
sapply(split(x, ff), prod)
gives the 'empty product', i.e., 1.
Exercise to the reader:
Note that
sapply(split(x, ff, drop=TRUE), sum)
gives you the values of (just) the non-empty levels.
Now, why does
sapply(split(x, ff), sum, drop=TRUE)
give the wrong value (1) for these levels, while
sapply(split(x, ff), sum, drop=FALSE)
gives the the correct value?
(The answer should be fairly obvious, but it's an easy mistake to make.)
--
Karl Ove Hufthammer
