[R] Fastest way to repeatedly subset a data frame?
Tony Plate
tplate at acm.org
Fri Apr 20 23:20:53 CEST 2007
This type of information about speeds of various techniques can really
only be found out by trying things out, especially because R-core has
recently made a fair number of improvements to some of the underlying
code in R. That's part of the reason I put these tests together -- I
wanted to know for myself what sort of speed differences there was now
among the various approaches.
-- Tony Plate
Iestyn Lewis wrote:
> This is fantastic. I just tested the first match() method and it is
> acceptably fast. I'll look into some of the even better methods
> later. Thank you for taking the time to put this together.
>
> Is this kind of optimization information on the web anywhere? I can
> imagine that a lot of people have slow sets of commands that could be
> optimized with this kind of knowledge.
>
> Thank you so much,
>
> Iestyn
>
> Tony Plate wrote:
>> Here's some timings on seemingly minor variations of data structure
>> showing timings ranging by a factor of 100 (factor of 3 if the worst
>> is omitted). One of the keys is to avoid use of the partial string
>> match that happens with ordinary data frame subscripting.
>>
>> -- Tony Plate
>>
>>> n <- 10000 # number of rows in data frame
>>> k <- 500 # number of vectors in indexing list
>>> # use a data frame with regular row names and id as factor (defaults
>> for data.frame)
>>> df <- data.frame(id=paste("ID", seq(len=n), sep=""),
>> result=seq(len=n), stringsAsFactors=TRUE)
>>> object.size(df)
>> [1] 440648
>>> df[1:3,,drop=FALSE]
>> id result
>> 1 ID1 1
>> 2 ID2 2
>> 3 ID3 3
>>> set.seed(1)
>>> ids <- lapply(seq(k), function(i) paste("ID", sample(n,
>> size=sample(seq(ceiling(n/1000), n/2, 1))), sep=""))
>>> sum(sapply(ids, length))
>> [1] 1263508
>>> system.time(lapply(ids, function(i) df[match(i, df$id),,drop=FALSE]))
>> user system elapsed
>> 3.00 0.00 3.03
>>> # use a data frame with automatic row names (should be low overhead)
>> and id as factor
>>> df <- data.frame(id=paste("ID", seq(len=n), sep=""),
>> result=seq(len=n), row.names=NULL, stringsAsFactors=TRUE)
>>> object.size(df)
>> [1] 440648
>>> df[1:3,,drop=FALSE]
>> id result
>> 1 ID1 1
>> 2 ID2 2
>> 3 ID3 3
>>> set.seed(1)
>>> ids <- lapply(seq(k), function(i) paste("ID", sample(n,
>> size=sample(seq(ceiling(n/1000), n/2, 1))), sep=""))
>>> sum(sapply(ids, length))
>> [1] 1263508
>>> system.time(lapply(ids, function(i) df[match(i, df$id),,drop=FALSE]))
>> user system elapsed
>> 2.68 0.00 2.70
>>> # use a data frame with automatic row names (should be low overhead)
>> and id as character
>>> df <- data.frame(id=paste("ID", seq(len=n), sep=""),
>> result=seq(len=n), row.names=NULL, stringsAsFactors=FALSE)
>>> object.size(df)
>> [1] 400448
>>> df[1:3,,drop=FALSE]
>> id result
>> 1 ID1 1
>> 2 ID2 2
>> 3 ID3 3
>>> set.seed(1)
>>> ids <- lapply(seq(k), function(i) paste("ID", sample(n,
>> size=sample(seq(ceiling(n/1000), n/2, 1))), sep=""))
>>> sum(sapply(ids, length))
>> [1] 1263508
>>> system.time(lapply(ids, function(i) df[match(i, df$id),,drop=FALSE]))
>> user system elapsed
>> 1.54 0.00 1.59
>>> # use a data frame with ids as the row names & subscripting for
>> matching (should be high overhead)
>>> df <- data.frame(id=paste("ID", seq(len=n), sep=""),
>> result=seq(len=n), row.names="id")
>>> object.size(df)
>> [1] 400384
>>> df[1:3,,drop=FALSE]
>> result
>> ID1 1
>> ID2 2
>> ID3 3
>>> set.seed(1)
>>> ids <- lapply(seq(k), function(i) paste("ID", sample(n,
>> size=sample(seq(ceiling(n/1000), n/2, 1))), sep=""))
>>> sum(sapply(ids, length))
>> [1] 1263508
>>> system.time(lapply(ids, function(i) df[i,,drop=FALSE]))
>> user system elapsed
>> 109.15 0.04 111.28
>>> # use a data frame with ids as the row names & match()
>>> df <- data.frame(id=paste("ID", seq(len=n), sep=""),
>> result=seq(len=n), row.names="id")
>>> object.size(df)
>> [1] 400384
>>> df[1:3,,drop=FALSE]
>> result
>> ID1 1
>> ID2 2
>> ID3 3
>>> set.seed(1)
>>> ids <- lapply(seq(k), function(i) paste("ID", sample(n,
>> size=sample(seq(ceiling(n/1000), n/2, 1))), sep=""))
>>> sum(sapply(ids, length))
>> [1] 1263508
>>> system.time(lapply(ids, function(i) df[match(i,
>> rownames(df)),,drop=FALSE]))
>> user system elapsed
>> 1.53 0.00 1.58
>>> # use a named numeric vector to store the same data as was stored in
>> the data frame
>>> x <- seq(len=n)
>>> names(x) <- paste("ID", seq(len=n), sep="")
>>> object.size(x)
>> [1] 400104
>>> x[1:3]
>> ID1 ID2 ID3
>> 1 2 3
>>> set.seed(1)
>>> ids <- lapply(seq(k), function(i) paste("ID", sample(n,
>> size=sample(seq(ceiling(n/1000), n/2, 1))), sep=""))
>>> sum(sapply(ids, length))
>> [1] 1263508
>>> system.time(lapply(ids, function(i) x[match(i, names(x))]))
>> user system elapsed
>> 1.14 0.05 1.19
>>
>>
>>
>>
>> Iestyn Lewis wrote:
>>> Good tip - an Rprof trace over my real data set resulted in a file
>>> filled with:
>>>
>>> pmatch [.data.frame [ FUN lapply
>>> pmatch [.data.frame [ FUN lapply
>>> pmatch [.data.frame [ FUN lapply
>>> pmatch [.data.frame [ FUN lapply
>>> pmatch [.data.frame [ FUN lapply
>>> ...
>>> with very few other calls in there. pmatch seems to be the string
>>> search function, so I'm guessing there's no hashing going on, or not
>>> very good hashing.
>>>
>>> I'll let you know how the environment option works - the Bioconductor
>>> project seems to make extensive use of it, so I'm guessing it's the
>>> way to go.
>>>
>>> Iestyn
>>>
>>> hadley wickham wrote:
>>>>> But... it's not any faster, which is worrisome to me because it seems
>>>>> like your code uses rownames and would take advantage of the hashing
>>>>> potential of named items.
>>>> I'm pretty sure it will use a hash to access the specified rows.
>>>> Before you pursue an environment based solution, you might want to
>>>> profile the code to check that the hashing is actually the slowest
>>>> part - I suspect creating all new data.frames is taking the most time.
>>>>
>>>> Hadley
>>> ______________________________________________
>>> R-help at stat.math.ethz.ch mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide
>>> http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>>
>
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
More information about the R-help
mailing list