[R] Faster Subsetting

Wed Sep 28 18:28:56 CEST 2016

Thank you very much. I don’t know tidyverse, I’ll look at that now. I did some tests with data.table package, but it was much slower on my machine, see examples below

tmp <- data.frame(id = rep(1:200, each = 10), foo = rnorm(2000))

idList <- unique(tmp$id)

system.time(replicate(500, tmp[which(tmp$id == idList[1]),]))

system.time(replicate(500, subset(tmp, id == idList[1])))

library(data.table)

tmp2 <- as.data.table(tmp)     # data.table

system.time(replicate(500, tmp2[which(tmp$id == idList[1]),]))

system.time(replicate(500, subset(tmp2, id == idList[1])))

From: Dominik Schneider [mailto:dosc3612 at colorado.edu]
Sent: Wednesday, September 28, 2016 12:27 PM
To: Doran, Harold <HDoran at air.org>
Cc: r-help at r-project.org
Subject: Re: [R] Faster Subsetting

I regularly crunch through this amount of data with tidyverse. You can also try the data.table package. They are optimized for speed, as long as you have the memory.
Dominik

On Wed, Sep 28, 2016 at 10:09 AM, Doran, Harold <HDoran at air.org<mailto:HDoran at air.org>> wrote:
I have an extremely large data frame (~13 million rows) that resembles the structure of the object tmp below in the reproducible code. In my real data, the variable, 'id' may or may not be ordered, but I think that is irrelevant.

I have a process that requires subsetting the data by id and then running each smaller data frame through a set of functions. One example below uses indexing and the other uses an explicit call to subset(), both return the same result, but indexing is faster.

Problem is in my real data, indexing must parse through millions of rows to evaluate the condition and this is expensive and a bottleneck in my code.  I'm curious if anyone can recommend an improvement that would somehow be less expensive and faster?

Thank you
Harold

tmp <- data.frame(id = rep(1:200, each = 10), foo = rnorm(2000))

idList <- unique(tmp$id)

### Fast, but not fast enough
system.time(replicate(500, tmp[which(tmp$id == idList[1]),]))

### Not fast at all, a big bottleneck
system.time(replicate(500, subset(tmp, id == idList[1])))

______________________________________________
R-help at r-project.org<mailto:R-help at r-project.org> mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

	[[alternative HTML version deleted]]