[R] the first and last observation for each subject

Mon Jan 5 18:02:45 CET 2009

> -----Original Message-----
> From: hadley wickham [mailto:h.wickham at gmail.com] 
> Sent: Sunday, January 04, 2009 8:56 PM
> To: William Dunlap
> Cc: gallon.li at gmail.com; R help
> Subject: Re: [R] the first and last observation for each subject
> 
> >> library(plyr)
> >>
> >> # ddply is for splitting up data frames and combining the results
> >> # into a data frame.  .(ID) says to split up the data frame by the
> > subject
> >> # variable
> >> ddply(DF, .(ID), function(one) with(one, y[length(y)] - y[1]))
> >> ...
> >
> > The above is much quicker than the versions based on aggregate and
> 
> plyr does make some optimisations to increase speed and decrease
> memory usage (mainly by passing around lists of indices, rather than
> lists of the original objects) but it's unlikely ever to approach the
> speed of a pure vector approach (although I hope to put some time into
> rewriting the slow parts in C to do better with performance).
> 
> > easy to understand.  Another approach is more specialized but useful
> > when you have lots of ID's (e.g., millions) and speed is 
> very important.
> > It computes where the first and last entry for each ID in a 
> vectorized
> > computation, akin to the computation that rle() uses:
> 
> I particularly this solution to the problem - it's a very handy
> technique, and while it takes a while to get your head around how it
> works, it's worthwhile spending the time to do so because it crops up
> as a useful solution to many similar types of problems. (It can be
> particularly useful in excel too, as a quick way of locating
> boundaries between groups)
> 
> Hadley
> 
> -- 
> http://had.co.nz/

Another application of that technique can be used to quickly compute
medians by groups:

gm <- function(x, group){ # medians by group:
sapply(split(x,group),median)
   o<-order(group, x)
   group <- group[o]
   x <- x[o]
   changes <- group[-1] != group[-length(group)]
   first <- which(c(TRUE, changes))
   last <- which(c(changes, TRUE))
   lowerMedian <- x[floor((first+last)/2)]
   upperMedian <- x[ceiling((first+last)/2)]
   median <- (lowerMedian+upperMedian)/2
   names(median) <- group[first]
   median
} 

For a 10^5 long x and a somewhat fewer than 3*10^4 distinct groups
(in random order) the times are:

> group<-sample(1:30000, size=100000, replace=TRUE)
> x<-rnorm(length(group))*10 + group
> unix.time(z0<-sapply(split(x,group), median))
   user  system elapsed 
   2.72    0.00    3.20 
> unix.time(z1<-gm(x,group))
   user  system elapsed 
   0.12    0.00    0.16 
> identical(z1,z0)
[1] TRUE

Bill Dunlap
TIBCO Software Inc - Spotfire Division
wdunlap tibco.com