[R] the first and last observation for each subject

Tue Jan 6 18:06:32 CET 2009

Just in case anyone is still interested, here are some
comparisons of the time it says to compute grouped medians
via sapply(split(x,group),median) and gm(x,group), which
uses the trick used by rle() to find the first and last
entries in each group.

Which method is fastest depends on the nature of your data:
the number of groups, k, the length of the data vector, n,
and probably the distribution of group size (which I didn't
look into).

If the overhead involved in calling a function were to go down
we would expect sapply() to look better in the cases where k/n
is close to 1.
   Bill

From: William Dunlap
Sent: Monday, January 05, 2009 4:01 PM
To: 'Kingsford Jones'
Subject: RE: [R] the first and last observation for each subject

> -----Original Message-----
> From: Kingsford Jones [mailto:kingsfordjones at gmail.com] 
> Sent: Monday, January 05, 2009 3:19 PM
> To: William Dunlap
> Subject: Re: [R] the first and last observation for each subject
> 
> Thanks for the explanation -- it's much more satisfying to have some
> theory rather than 'just trying things out'...
> 
> Kingsford

Trying things out helps sort out the constants in those 'order of'
assertions.  Here are times for various n and k (both ranging over
4^(3:10), with k<=n) for gm and sapply:

> time.gm
         k
n           64  256 1024 4096 16384 65536 262144 1048576
  64      0.00   NA   NA   NA    NA    NA     NA      NA
  256     0.00 0.00   NA   NA    NA    NA     NA      NA
  1024    0.00 0.00 0.00   NA    NA    NA     NA      NA
  4096    0.00 0.00 0.00 0.00    NA    NA     NA      NA
  16384   0.00 0.01 0.02 0.02  0.04    NA     NA      NA
  65536   0.08 0.06 0.07 0.06  0.05  0.09     NA      NA
  262144  0.36 0.36 0.33 0.34  0.28  0.26   0.42      NA
  1048576 3.46 2.83 1.88 1.91  1.96  1.39   1.17     2.2
> time.sapply
         k
n           64  256 1024 4096 16384 65536 262144 1048576
  64      0.01   NA   NA   NA    NA    NA     NA      NA
  256     0.01 0.03   NA   NA    NA    NA     NA      NA
  1024    0.02 0.03 0.09   NA    NA    NA     NA      NA
  4096    0.00 0.03 0.14 0.41    NA    NA     NA      NA
  16384   0.02 0.05 0.14 0.55  1.61    NA     NA      NA
  65536   0.02 0.03 0.14 0.55  2.22  6.54     NA      NA
  262144  0.03 0.05 0.17 0.60  2.28  8.93  27.44      NA
  1048576 0.27 0.16 0.27 0.71  2.41  9.22  37.56  121.55
> time.gm<time.sapply
         k
n            64   256  1024  4096 16384 65536 262144 1048576
  64       TRUE    NA    NA    NA    NA    NA     NA      NA
  256      TRUE  TRUE    NA    NA    NA    NA     NA      NA
  1024     TRUE  TRUE  TRUE    NA    NA    NA     NA      NA
  4096    FALSE  TRUE  TRUE  TRUE    NA    NA     NA      NA
  16384    TRUE  TRUE  TRUE  TRUE  TRUE    NA     NA      NA
  65536   FALSE FALSE  TRUE  TRUE  TRUE  TRUE     NA      NA
  262144  FALSE FALSE FALSE  TRUE  TRUE  TRUE   TRUE      NA
  1048576 FALSE FALSE FALSE FALSE  TRUE  TRUE   TRUE    TRUE
> round(time.gm/time.sapply,3)
         k
n             64    256  1024  4096 16384 65536 262144 1048576
  64       0.000     NA    NA    NA    NA    NA     NA      NA
  256      0.000  0.000    NA    NA    NA    NA     NA      NA
  1024     0.000  0.000 0.000    NA    NA    NA     NA      NA
  4096       NaN  0.000 0.000 0.000    NA    NA     NA      NA
  16384    0.000  0.200 0.143 0.036 0.025    NA     NA      NA
  65536    4.000  2.000 0.500 0.109 0.023 0.014     NA      NA
  262144  12.000  7.200 1.941 0.567 0.123 0.029  0.015      NA
  1048576 12.815 17.688 6.963 2.690 0.813 0.151  0.031   0.018
> 
> On Mon, Jan 5, 2009 at 4:00 PM, William Dunlap 
> <wdunlap at tibco.com> wrote:
> > Note that gm fully sorts the whole vector, taking order n*log(n)
> > time, then does an order n subscripting operation, then
> > some order k stuff.
> >
> > The sapply approach does k calls to median, each of which
> > takes order n/k time to partially sort its short input (assuming
> > all groups are about the same size), resulting in order n total
> > sorting time (I think the partial sort is order n, if not it is
> > pretty close to that for n up to 2^25).
> > It also involves some other order n and order k time operations.
> >
> > Thus for a fixed k, as n grows the sapply approach ought to
> > do better, but for a fixed n, as k grows gm ought to do better.
> > gm() only does better in some cases (when k/n is close to 1)
> > because it avoids the function call overhead involved in calling
> > median() many times.
> >
> > Bill Dunlap
> > TIBCO Software Inc - Spotfire Division
> > wdunlap tibco.com
> >
> >> -----Original Message-----
> >> From: Kingsford Jones [mailto:kingsfordjones at gmail.com]
> >> Sent: Monday, January 05, 2009 2:20 PM
> >> To: William Dunlap
> >> Cc: hadley wickham; R help
> >> Subject: Re: [R] the first and last observation for each subject
> >>
> >> whoops -- I left the group size unchanged so k became greather than
> >> the length of the group vector.  When I increase the size to 1e7,
> >> sapply is faster until it gets to k = 1e6.
> >>
> >> warning:  this takes awhile (particularly on my machine 
> which seems to
> >> be using just 1 of it's 2 cpus)
> >>
> >> >  for(k in 10^(1:6)){
> >> +   group<-sample(1:k, size=1e7, replace=TRUE)
> >> +   x<-rnorm(length(group))*10 + group
> >> +   cat('number of groups:', k, '\n')
> >> +   cat('sapply\n')
> >> +   print(s <- unix.time(sapply(split(x,group), median)))
> >> +   cat('gm\n')
> >> +   print(g <- unix.time(-gm(x,group)))
> >> +   cat(' sapply/gm user ratio:', s[1]/g[1], '\n\n\n')
> >> +  }
> >> number of groups: 10
> >> sapply
> >>    user  system elapsed
> >>    1.18    0.38    1.56
> >> gm
> >>    user  system elapsed
> >>   53.43    0.70   55.05
> >>  sapply/gm user ratio: 0.02208497
> >>
> >>
> >> number of groups: 100
> >> sapply
> >>    user  system elapsed
> >>    1.17    0.23    1.42
> >> gm
> >>    user  system elapsed
> >>   49.89    0.61   51.60
> >>  sapply/gm user ratio: 0.02345159
> >>
> >>
> >> number of groups: 1000
> >> sapply
> >>    user  system elapsed
> >>    1.29    0.25    1.72
> >> gm
> >>    user  system elapsed
> >>   45.23    0.34   46.55
> >>  sapply/gm user ratio: 0.02852089
> >>
> >>
> >> number of groups: 10000
> >> sapply
> >>    user  system elapsed
> >>    2.80    0.09    2.87
> >> gm
> >>    user  system elapsed
> >>   41.04    0.55   42.85
> >>  sapply/gm user ratio: 0.06822612
> >>
> >>
> >> number of groups: 1e+05
> >> sapply
> >>    user  system elapsed
> >>   14.65    0.16   15.18
> >> gm
> >>    user  system elapsed
> >>   38.28    0.65   39.55
> >>  sapply/gm user ratio: 0.3827064
> >>
> >>
> >> number of groups: 1e+06
> >> sapply
> >>    user  system elapsed
> >>  126.63    0.42  129.21
> >> gm
> >>    user  system elapsed
> >>   37.56    0.84   38.78
> >>  sapply/gm user ratio: 3.371406
> >>
> >> On Mon, Jan 5, 2009 at 2:37 PM, Kingsford Jones
> >> <kingsfordjones at gmail.com> wrote:
> >> > Here's some more timing's of Bill's function.  Although in this
> >> > example sapply has a clear performance advantage for 
> smaller numbers
> >> > of groups (k) , gm is substantially faster for k >> 1000:
> >> >
> >> >  gm <- function(x, group){ # medians by group:
> >> >   o<-order(group, x)
> >> >   group <- group[o]
> >> >   x <- x[o]
> >> >   changes <- group[-1] != group[-length(group)]
> >> >   first <- which(c(TRUE, changes))
> >> >   last <- which(c(changes, TRUE))
> >> >   lowerMedian <- x[floor((first+last)/2)]
> >> >   upperMedian <- x[ceiling((first+last)/2)]
> >> >   median <- (lowerMedian+upperMedian)/2
> >> >   names(median) <- group[first]
> >> >   median
> >> >  }
> >> >
> >> >  ##
> >> >
> >> >  for(k in 10^(1:6)){
> >> >  group<-sample(1:k, size=100000, replace=TRUE)
> >> >  x<-rnorm(length(group))*10 + group
> >> >  cat('number of groups:', k, '\n')
> >> >  cat('sapply\n')
> >> >  print(s <- unix.time(sapply(split(x,group), median)))
> >> >  cat('gm\n')
> >> >  print(g <- unix.time(-gm(x,group)))
> >> >  cat(' sapply/gm user ratio:', s[1]/g[1], '\n\n\n')
> >> >  }
> >> >
> >> > number of groups: 10
> >> > sapply
> >> >   user  system elapsed
> >> >   0.01    0.00    0.01
> >> > gm
> >> >   user  system elapsed
> >> >   0.14    0.00    0.16
> >> >  sapply/gm user ratio: 0.07142857
> >> >
> >> >
> >> > number of groups: 100
> >> > sapply
> >> >   user  system elapsed
> >> >   0.02    0.00    0.02
> >> > gm
> >> >   user  system elapsed
> >> >   0.11    0.00    0.09
> >> >  sapply/gm user ratio: 0.1818182
> >> >
> >> >
> >> > number of groups: 1000
> >> > sapply
> >> >   user  system elapsed
> >> >   0.11    0.00    0.11
> >> > gm
> >> >   user  system elapsed
> >> >   0.11    0.00    0.11
> >> >  sapply/gm user ratio: 1
> >> >
> >> >
> >> > number of groups: 10000
> >> > sapply
> >> >   user  system elapsed
> >> >   1.00    0.00    1.04
> >> > gm
> >> >   user  system elapsed
> >> >   0.10    0.00    0.09
> >> >  sapply/gm user ratio: 10
> >> >
> >> >
> >> > number of groups: 1e+05
> >> > sapply
> >> >   user  system elapsed
> >> >   6.24    0.01    6.31
> >> > gm
> >> >   user  system elapsed
> >> >   0.16    0.00    0.17
> >> >  sapply/gm user ratio: 39
> >> >
> >> >
> >> > number of groups: 1e+06
> >> > sapply
> >> >   user  system elapsed
> >> >   9.00    0.03    8.92
> >> > gm
> >> >   user  system elapsed
> >> >   0.15    0.00    0.20
> >> >  sapply/gm user ratio: 60
> >> >
> >> >
> >> >> R.Version()
> >> > $platform
> >> > [1] "i386-pc-mingw32"
> >> >
> >> > $arch
> >> > [1] "i386"
> >> >
> >> > $os
> >> > [1] "mingw32"
> >> >
> >> > $system
> >> > [1] "i386, mingw32"
> >> >
> >> > $status
> >> > [1] ""
> >> >
> >> > $major
> >> > [1] "2"
> >> >
> >> > $minor
> >> > [1] "8.0"
> >> >
> >> > $year
> >> > [1] "2008"
> >> >
> >> > $month
> >> > [1] "10"
> >> >
> >> > $day
> >> > [1] "20"
> >> >
> >> > $`svn rev`
> >> > [1] "46754"
> >> >
> >> > $language
> >> > [1] "R"
> >> >
> >> > $version.string
> >> > [1] "R version 2.8.0 (2008-10-20)"
> >> >
> >> >
> >> > Kingsford Jones
> >> >
> >> >
> >> >
> >> > On Mon, Jan 5, 2009 at 10:18 AM, William Dunlap
> >> <wdunlap at tibco.com> wrote:
> >> >> Arg, the 'sapply(...)' in the function was in the initial
> >> >> comment,
> >> >>   gm <- function(x, group){ # medians by group:
> >> >> sapply(split(x,group),median)
> >> >> but someone's mailer put a newline before the sapply
> >> >>   gm <- function(x, group){ # medians by group:
> >> >>   sapply(split(x,group),median)
> >> >> so it got executed, wasting all that time.  (Of course the
> >> >> same mailer will mess up this message in the same way -
> >> >> just delete the sapply() call from gm().)
> >> >>
> >> >> Bill Dunlap
> >> >> TIBCO Software Inc - Spotfire Division
> >> >> wdunlap tibco.com
> >> >>
> >> >>> -----Original Message-----
> >> >>> From: hadley wickham [mailto:h.wickham at gmail.com]
> >> >>> Sent: Monday, January 05, 2009 9:10 AM
> >> >>> To: William Dunlap
> >> >>> Cc: gallon.li at gmail.com; R help
> >> >>> Subject: Re: [R] the first and last observation for 
> each subject
> >> >>>
> >> >>> > Another application of that technique can be used to
> >> quickly compute
> >> >>> > medians by groups:
> >> >>> >
> >> >>> > gm <- function(x, group){ # medians by group:
> >> >>> > sapply(split(x,group),median)
> >> >>> >   o<-order(group, x)
> >> >>> >   group <- group[o]
> >> >>> >   x <- x[o]
> >> >>> >   changes <- group[-1] != group[-length(group)]
> >> >>> >   first <- which(c(TRUE, changes))
> >> >>> >   last <- which(c(changes, TRUE))
> >> >>> >   lowerMedian <- x[floor((first+last)/2)]
> >> >>> >   upperMedian <- x[ceiling((first+last)/2)]
> >> >>> >   median <- (lowerMedian+upperMedian)/2
> >> >>> >   names(median) <- group[first]
> >> >>> >   median
> >> >>> > }
> >> >>> >
> >> >>> > For a 10^5 long x and a somewhat fewer than 3*10^4
> >> distinct groups
> >> >>> > (in random order) the times are:
> >> >>> >
> >> >>> >> group<-sample(1:30000, size=100000, replace=TRUE)
> >> >>> >> x<-rnorm(length(group))*10 + group
> >> >>> >> unix.time(z0<-sapply(split(x,group), median))
> >> >>> >   user  system elapsed
> >> >>> >   2.72    0.00    3.20
> >> >>> >> unix.time(z1<-gm(x,group))
> >> >>> >   user  system elapsed
> >> >>> >   0.12    0.00    0.16
> >> >>> >> identical(z1,z0)
> >> >>> > [1] TRUE
> >> >>>
> >> >>> I get:
> >> >>>
> >> >>> > unix.time(z0<-sapply(split(x,group), median))
> >> >>>    user  system elapsed
> >> >>>   2.733   0.017   2.766
> >> >>> > unix.time(z1<-gm(x,group))
> >> >>>    user  system elapsed
> >> >>>   2.897   0.032   2.946
> >> >>>
> >> >>>
> >> >>> Hadley
> >> >>>
> >> >>>
> >> >>> --
> >> >>> http://had.co.nz/
> >> >>>
> >> >>
> >> >> ______________________________________________
> >> >> R-help at r-project.org mailing list
> >> >> https://stat.ethz.ch/mailman/listinfo/r-help
> >> >> PLEASE do read the posting guide
> >> http://www.R-project.org/posting-guide.html
> >> >> and provide commented, minimal, self-contained, 
> reproducible code.
> >> >>
> >> >
> >>
> >
>