[R] the first and last observation for each subject
William Dunlap
wdunlap at tibco.com
Mon Jan 5 18:02:45 CET 2009
> -----Original Message-----
> From: hadley wickham [mailto:h.wickham at gmail.com]
> Sent: Sunday, January 04, 2009 8:56 PM
> To: William Dunlap
> Cc: gallon.li at gmail.com; R help
> Subject: Re: [R] the first and last observation for each subject
>
> >> library(plyr)
> >>
> >> # ddply is for splitting up data frames and combining the results
> >> # into a data frame. .(ID) says to split up the data frame by the
> > subject
> >> # variable
> >> ddply(DF, .(ID), function(one) with(one, y[length(y)] - y[1]))
> >> ...
> >
> > The above is much quicker than the versions based on aggregate and
>
> plyr does make some optimisations to increase speed and decrease
> memory usage (mainly by passing around lists of indices, rather than
> lists of the original objects) but it's unlikely ever to approach the
> speed of a pure vector approach (although I hope to put some time into
> rewriting the slow parts in C to do better with performance).
>
> > easy to understand. Another approach is more specialized but useful
> > when you have lots of ID's (e.g., millions) and speed is
> very important.
> > It computes where the first and last entry for each ID in a
> vectorized
> > computation, akin to the computation that rle() uses:
>
> I particularly this solution to the problem - it's a very handy
> technique, and while it takes a while to get your head around how it
> works, it's worthwhile spending the time to do so because it crops up
> as a useful solution to many similar types of problems. (It can be
> particularly useful in excel too, as a quick way of locating
> boundaries between groups)
>
> Hadley
>
> --
> http://had.co.nz/
Another application of that technique can be used to quickly compute
medians by groups:
gm <- function(x, group){ # medians by group:
sapply(split(x,group),median)
o<-order(group, x)
group <- group[o]
x <- x[o]
changes <- group[-1] != group[-length(group)]
first <- which(c(TRUE, changes))
last <- which(c(changes, TRUE))
lowerMedian <- x[floor((first+last)/2)]
upperMedian <- x[ceiling((first+last)/2)]
median <- (lowerMedian+upperMedian)/2
names(median) <- group[first]
median
}
For a 10^5 long x and a somewhat fewer than 3*10^4 distinct groups
(in random order) the times are:
> group<-sample(1:30000, size=100000, replace=TRUE)
> x<-rnorm(length(group))*10 + group
> unix.time(z0<-sapply(split(x,group), median))
user system elapsed
2.72 0.00 3.20
> unix.time(z1<-gm(x,group))
user system elapsed
0.12 0.00 0.16
> identical(z1,z0)
[1] TRUE
Bill Dunlap
TIBCO Software Inc - Spotfire Division
wdunlap tibco.com
More information about the R-help
mailing list