[R] Fastest way to repeatedly subset a data frame?

Fri Apr 20 22:33:57 CEST 2007

Here's some timings on seemingly minor variations of data structure 
showing timings ranging by a factor of 100 (factor of 3 if the worst is 
omitted).  One of the keys is to avoid use of the partial string match 
that happens with ordinary data frame subscripting.

-- Tony Plate

 > n <- 10000 # number of rows in data frame
 > k <- 500   # number of vectors in indexing list
 > # use a data frame with regular row names and id as factor (defaults 
for data.frame)
 > df <- data.frame(id=paste("ID", seq(len=n), sep=""), 
result=seq(len=n), stringsAsFactors=TRUE)
 > object.size(df)
[1] 440648
 > df[1:3,,drop=FALSE]
    id result
1 ID1      1
2 ID2      2
3 ID3      3
 > set.seed(1)
 > ids <- lapply(seq(k), function(i) paste("ID", sample(n, 
size=sample(seq(ceiling(n/1000), n/2, 1))), sep=""))
 > sum(sapply(ids, length))
[1] 1263508
 > system.time(lapply(ids, function(i) df[match(i, df$id),,drop=FALSE]))
    user  system elapsed
    3.00    0.00    3.03
 >
 > # use a data frame with automatic row names (should be low overhead) 
and id as factor
 > df <- data.frame(id=paste("ID", seq(len=n), sep=""), 
result=seq(len=n), row.names=NULL, stringsAsFactors=TRUE)
 > object.size(df)
[1] 440648
 > df[1:3,,drop=FALSE]
    id result
1 ID1      1
2 ID2      2
3 ID3      3
 > set.seed(1)
 > ids <- lapply(seq(k), function(i) paste("ID", sample(n, 
size=sample(seq(ceiling(n/1000), n/2, 1))), sep=""))
 > sum(sapply(ids, length))
[1] 1263508
 > system.time(lapply(ids, function(i) df[match(i, df$id),,drop=FALSE]))
    user  system elapsed
    2.68    0.00    2.70
 >
 > # use a data frame with automatic row names (should be low overhead) 
and id as character
 > df <- data.frame(id=paste("ID", seq(len=n), sep=""), 
result=seq(len=n), row.names=NULL, stringsAsFactors=FALSE)
 > object.size(df)
[1] 400448
 > df[1:3,,drop=FALSE]
    id result
1 ID1      1
2 ID2      2
3 ID3      3
 > set.seed(1)
 > ids <- lapply(seq(k), function(i) paste("ID", sample(n, 
size=sample(seq(ceiling(n/1000), n/2, 1))), sep=""))
 > sum(sapply(ids, length))
[1] 1263508
 > system.time(lapply(ids, function(i) df[match(i, df$id),,drop=FALSE]))
    user  system elapsed
    1.54    0.00    1.59
 >
 > # use a data frame with ids as the row names & subscripting for 
matching (should be high overhead)
 > df <- data.frame(id=paste("ID", seq(len=n), sep=""), 
result=seq(len=n), row.names="id")
 > object.size(df)
[1] 400384
 > df[1:3,,drop=FALSE]
     result
ID1      1
ID2      2
ID3      3
 > set.seed(1)
 > ids <- lapply(seq(k), function(i) paste("ID", sample(n, 
size=sample(seq(ceiling(n/1000), n/2, 1))), sep=""))
 > sum(sapply(ids, length))
[1] 1263508
 > system.time(lapply(ids, function(i) df[i,,drop=FALSE]))
    user  system elapsed
  109.15    0.04  111.28
 >
 > # use a data frame with ids as the row names & match()
 > df <- data.frame(id=paste("ID", seq(len=n), sep=""), 
result=seq(len=n), row.names="id")
 > object.size(df)
[1] 400384
 > df[1:3,,drop=FALSE]
     result
ID1      1
ID2      2
ID3      3
 > set.seed(1)
 > ids <- lapply(seq(k), function(i) paste("ID", sample(n, 
size=sample(seq(ceiling(n/1000), n/2, 1))), sep=""))
 > sum(sapply(ids, length))
[1] 1263508
 > system.time(lapply(ids, function(i) df[match(i, 
rownames(df)),,drop=FALSE]))
    user  system elapsed
    1.53    0.00    1.58
 >
 > # use a named numeric vector to store the same data as was stored in 
the data frame
 > x <- seq(len=n)
 > names(x) <- paste("ID", seq(len=n), sep="")
 > object.size(x)
[1] 400104
 > x[1:3]
ID1 ID2 ID3
   1   2   3
 > set.seed(1)
 > ids <- lapply(seq(k), function(i) paste("ID", sample(n, 
size=sample(seq(ceiling(n/1000), n/2, 1))), sep=""))
 > sum(sapply(ids, length))
[1] 1263508
 > system.time(lapply(ids, function(i) x[match(i, names(x))]))
    user  system elapsed
    1.14    0.05    1.19
 >

Iestyn Lewis wrote:
> Good tip - an Rprof trace over my real data set resulted in a file 
> filled with:
> 
> pmatch [.data.frame [ FUN lapply
> pmatch [.data.frame [ FUN lapply
> pmatch [.data.frame [ FUN lapply
> pmatch [.data.frame [ FUN lapply
> pmatch [.data.frame [ FUN lapply
> ...
> with very few other calls in there.  pmatch seems to be the string 
> search function, so I'm guessing there's no hashing going on, or not 
> very good hashing.
> 
> I'll let you know how the environment option works - the Bioconductor 
> project seems to make extensive use of it, so I'm guessing it's the way 
> to go.
> 
> Iestyn
> 
> hadley wickham wrote:
>>> But... it's not any faster, which is worrisome to me because it seems
>>> like your code uses rownames and would take advantage of the hashing
>>> potential of named items.
>> I'm pretty sure it will use a hash to access the specified rows.
>> Before you pursue an environment based solution, you might want to
>> profile the code to check that the hashing is actually the slowest
>> part - I suspect creating all new data.frames is taking the most time.
>>
>> Hadley
> 
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>