[R] long format - find age when another variable is first 'high'

Tue May 26 18:42:29 CEST 2009

> From: r-help-bounces at r-project.org 
> [mailto:r-help-bounces at r-project.org] On Behalf Of Marc Schwartz
> Sent: Monday, May 25, 2009 6:52 AM
> To: David Freedman
> Cc: r-help at r-project.org
> Subject: Re: [R] long format - find age when another variable 
> is first 'high'
> 
> 
> On May 25, 2009, at 7:45 AM, David Freedman wrote:
> 
> >
> > Dear R,
> >
> > I've got a data frame with children examined multiple times and at  
> > various
> > ages.  I'm trying to find the first age at which another variable
> > (LDL-Cholesterol) is >= 130 mg/dL; for some children, this 
> may never  
> > happen.
> > I can do this with transformBy and ddply, but with 10,000 different
> > children, these functions take some time on my PCs - is there a  
> > faster way
> > to do this in R?  My code on a small dataset follows.
> >
> > Thanks very much, David Freedman
> >
> > d<-data.frame(id=c(rep(1,3),rep(2,2), 
> > 3),age=c(5,10,15,4,7,12),ldlc=c(132,120,125,105,142,160))
> > d$high.ldlc<-ifelse(d$ldlc>=130,1,0)
> > d
> > library(plyr)
> > d2<-ddply(d,~id,transform,plyr.minage=min(age[high.ldlc==1]));
> > library(doBy)
> > d2<-transformBy(~id,da=d2,doby.minage=min(age[high.ldlc==1]));
> > d2
> 
> The first thing that I would do is to get rid of records that 
> are not  
> relevant to your question:
> 
>  > d
> id age ldlc high.ldlc
> 1  1   5  132         1
> 2  1  10  120         0
> 3  1  15  125         0
> 4  2   4  105         0
> 5  2   7  142         1
> 6  3  12  160         1
> 
> 
> # Get records with high ldl
> d.new <- subset(d, ldlc >= 130)
> 
> 
>  > d.new
> id age ldlc high.ldlc
> 1  1   5  132         1
> 5  2   7  142         1
> 6  3  12  160         1
> 
> 
> That will help to reduce the total size of the dataset, perhaps  
> substantially. It will also remove entire subjects that are not  
> relevant (eg. never have LDL >= 130).
> 
> Then get the minimum age for each of the remaining subjects:
> 
>  > aggregate(d.new$age, list(id = d.new$id), min)
> id  x
> 1  1  5
> 2  2  7
> 3  3 12

If the dataset has a lot of rows you can save more time
by replacing the call to aggregate(age,id,min) by code that sorts
the filtered data by 'id' then breaking ties with 'age', and
then picking out the elements just after a change in the
value of 'id':
    f <- function(d) {
         dSorted <- d[ order(d$id,d$age),]
         n <- length(d$id) # or nrow(d)
         dSorted[   c(TRUE, dSorted$id[-1] != dSorted$id[-n]), ]
    }
    f(d.new) # or f(d[d$ldlc>=130,]) to avoid leaving around the temp
variable.
If you know your dataset is already sorted in this way, you just
need only the last line of that function.

Bill Dunlap
TIBCO Software Inc - Spotfire Division
wdunlap tibco.com 

> 
> Try that to see what sort of time reduction you observe.
> 
> HTH,
> 
> Marc Schwartz
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide 
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>