[R] Fastest way to repeatedly subset a data frame?
Tony Plate
tplate at acm.org
Fri Apr 20 22:33:57 CEST 2007
Here's some timings on seemingly minor variations of data structure
showing timings ranging by a factor of 100 (factor of 3 if the worst is
omitted). One of the keys is to avoid use of the partial string match
that happens with ordinary data frame subscripting.
-- Tony Plate
> n <- 10000 # number of rows in data frame
> k <- 500 # number of vectors in indexing list
> # use a data frame with regular row names and id as factor (defaults
for data.frame)
> df <- data.frame(id=paste("ID", seq(len=n), sep=""),
result=seq(len=n), stringsAsFactors=TRUE)
> object.size(df)
[1] 440648
> df[1:3,,drop=FALSE]
id result
1 ID1 1
2 ID2 2
3 ID3 3
> set.seed(1)
> ids <- lapply(seq(k), function(i) paste("ID", sample(n,
size=sample(seq(ceiling(n/1000), n/2, 1))), sep=""))
> sum(sapply(ids, length))
[1] 1263508
> system.time(lapply(ids, function(i) df[match(i, df$id),,drop=FALSE]))
user system elapsed
3.00 0.00 3.03
>
> # use a data frame with automatic row names (should be low overhead)
and id as factor
> df <- data.frame(id=paste("ID", seq(len=n), sep=""),
result=seq(len=n), row.names=NULL, stringsAsFactors=TRUE)
> object.size(df)
[1] 440648
> df[1:3,,drop=FALSE]
id result
1 ID1 1
2 ID2 2
3 ID3 3
> set.seed(1)
> ids <- lapply(seq(k), function(i) paste("ID", sample(n,
size=sample(seq(ceiling(n/1000), n/2, 1))), sep=""))
> sum(sapply(ids, length))
[1] 1263508
> system.time(lapply(ids, function(i) df[match(i, df$id),,drop=FALSE]))
user system elapsed
2.68 0.00 2.70
>
> # use a data frame with automatic row names (should be low overhead)
and id as character
> df <- data.frame(id=paste("ID", seq(len=n), sep=""),
result=seq(len=n), row.names=NULL, stringsAsFactors=FALSE)
> object.size(df)
[1] 400448
> df[1:3,,drop=FALSE]
id result
1 ID1 1
2 ID2 2
3 ID3 3
> set.seed(1)
> ids <- lapply(seq(k), function(i) paste("ID", sample(n,
size=sample(seq(ceiling(n/1000), n/2, 1))), sep=""))
> sum(sapply(ids, length))
[1] 1263508
> system.time(lapply(ids, function(i) df[match(i, df$id),,drop=FALSE]))
user system elapsed
1.54 0.00 1.59
>
> # use a data frame with ids as the row names & subscripting for
matching (should be high overhead)
> df <- data.frame(id=paste("ID", seq(len=n), sep=""),
result=seq(len=n), row.names="id")
> object.size(df)
[1] 400384
> df[1:3,,drop=FALSE]
result
ID1 1
ID2 2
ID3 3
> set.seed(1)
> ids <- lapply(seq(k), function(i) paste("ID", sample(n,
size=sample(seq(ceiling(n/1000), n/2, 1))), sep=""))
> sum(sapply(ids, length))
[1] 1263508
> system.time(lapply(ids, function(i) df[i,,drop=FALSE]))
user system elapsed
109.15 0.04 111.28
>
> # use a data frame with ids as the row names & match()
> df <- data.frame(id=paste("ID", seq(len=n), sep=""),
result=seq(len=n), row.names="id")
> object.size(df)
[1] 400384
> df[1:3,,drop=FALSE]
result
ID1 1
ID2 2
ID3 3
> set.seed(1)
> ids <- lapply(seq(k), function(i) paste("ID", sample(n,
size=sample(seq(ceiling(n/1000), n/2, 1))), sep=""))
> sum(sapply(ids, length))
[1] 1263508
> system.time(lapply(ids, function(i) df[match(i,
rownames(df)),,drop=FALSE]))
user system elapsed
1.53 0.00 1.58
>
> # use a named numeric vector to store the same data as was stored in
the data frame
> x <- seq(len=n)
> names(x) <- paste("ID", seq(len=n), sep="")
> object.size(x)
[1] 400104
> x[1:3]
ID1 ID2 ID3
1 2 3
> set.seed(1)
> ids <- lapply(seq(k), function(i) paste("ID", sample(n,
size=sample(seq(ceiling(n/1000), n/2, 1))), sep=""))
> sum(sapply(ids, length))
[1] 1263508
> system.time(lapply(ids, function(i) x[match(i, names(x))]))
user system elapsed
1.14 0.05 1.19
>
Iestyn Lewis wrote:
> Good tip - an Rprof trace over my real data set resulted in a file
> filled with:
>
> pmatch [.data.frame [ FUN lapply
> pmatch [.data.frame [ FUN lapply
> pmatch [.data.frame [ FUN lapply
> pmatch [.data.frame [ FUN lapply
> pmatch [.data.frame [ FUN lapply
> ...
> with very few other calls in there. pmatch seems to be the string
> search function, so I'm guessing there's no hashing going on, or not
> very good hashing.
>
> I'll let you know how the environment option works - the Bioconductor
> project seems to make extensive use of it, so I'm guessing it's the way
> to go.
>
> Iestyn
>
> hadley wickham wrote:
>>> But... it's not any faster, which is worrisome to me because it seems
>>> like your code uses rownames and would take advantage of the hashing
>>> potential of named items.
>> I'm pretty sure it will use a hash to access the specified rows.
>> Before you pursue an environment based solution, you might want to
>> profile the code to check that the hashing is actually the slowest
>> part - I suspect creating all new data.frames is taking the most time.
>>
>> Hadley
>
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
More information about the R-help
mailing list