[R] aggregate(), tapply(): Why is the order of the grouping variables not kept?

Peter Ehlers ehlers at ucalgary.ca
Tue Mar 12 00:59:02 CET 2013


On 2013-03-11 13:52, Marius Hofert wrote:
> Dear expeRts,
>
> The question is rather simple: Why does aggregate (or similarly tapply()) not keep the order of the grouping variable(s)?
>
> Here is an example:
>
> x <- data.frame(group = rep(LETTERS[1:2], each=10),
>                  year  = rep(rep(2001:2005, each=2), 2),
>                  value = rep(1:10, each=2))
> ## => sorted according to group, then year
> aggregate(value ~ group + year, data=x, FUN=function(z) z[1])
> ## => sorted according to year, then group
>
> I rather expected this to be the default:
>
> aggregate(value ~ year + group, data=x, FUN=function(z) z[1])[,c(2,1,3)]
> ## => same order as input (grouping) variables
>
> Same with tapply:
>
> as.data.frame(as.table(tapply(x$value, list(x$group, x$year), FUN=function(z) z[1])))
>
>
> Cheers,
>
> Marius

I'm no expeRt, but suppose that we change the setup slightly:

   xx <- x[sample(nrow(x)), ]

Now what would you like

  aggregate(value ~ group + year, data=xx, FUN=function(z) z[1])

to return?

Personally, I prefer to have R return the same thing regardless
of how the input dataframe is sorted, i.e. the result should
depend only on the formula. You just have to know that the order
is to have the first factor vary most rapidly, then the next, etc.
I think that's documented somewhere, but I don't know where.

Peter Ehlers



More information about the R-help mailing list