[R] Fastest way to repeatedly subset a data frame?

Fri Apr 20 21:07:09 CEST 2007

That is a seriously neat bit of code there.  (I'm new to non-loop-based 
programming, forgive my enthusiasm).

But... it's not any faster, which is worrisome to me because it seems 
like your code uses rownames and would take advantage of the hashing 
potential of named items.

I'm currently looking at converting the vectors of ids to lists.  I've 
also come across some pages which make reference to a setting up a new 
environment using the hash=TRUE argument, but it's unclear to me on how 
you go about using that new environment. 

Thanks,

Iestyn

hadley wickham wrote:
> On 4/20/07, Iestyn Lewis <ilewis at pharm.emory.edu> wrote:
>> Hi -
>>
>>  I have a data frame with a large number of observations (62,000 rows,
>> but only 2 columns - a character ID and a result list).
>>
>> Sample:
>>
>>  > my.df <- data.frame(id=c("ID1", "ID2", "ID3"), result=1:3)
>>  > my.df
>>    id result
>> 1 ID1      1
>> 2 ID2      2
>> 3 ID3      3
>>
>> I have a list of ID vectors.  This list will have anywhere from 100 to
>> 1000 members, and each member will have anywhere from 10 to 5000 id 
>> entries.
>>
>> Sample:
>>
>>  > my.idlist[["List1"]] <- c("ID1", "ID3")
>>  > my.idlist[["List2"]] <- c("ID2")
>>  > my.idlist
>> $List1
>> [1] "ID1" "ID3"
>>
>> $List2
>> [1] "ID2"
>>
>>
>> I need to subset that data frame by the list of IDs in each vector, to
>> end up with vectors that contain just the results for the IDs found in
>> each vector in the list.  My current approach is to create new columns
>> in the original data frame with the names of the list items, and any
>> results that don't match replaced with NA.  Here is what I've done so 
>> far:
>>
>> createSubsets <- function(res, slib) {
>>     for(i in 1:length(slib)) {
>>         res[ ,names(slib)[i]] <- replace(res$result,
>> which(!is.element(res$sid, slib[[i]])), NA)
>>         return (res)
>>     }
>> }
>>
>> I have 2 problems:
>>
>> 1)  My function only works for the first item in the list:
>>
>>  > my.df <- createSubsets(my.df, my.idlist)
>>  > my.df
>>    id result List1
>> 1 ID1      1     1
>> 2 ID2      2    NA
>> 3 ID3      3     3
>>
>> In order to get all results, I have to copy the loop out of the function
>> and paste it into R directly.
>>
>> 2)  It is very, very slow.  For a dataset of 62,000 rows and 253 list
>> entries, it takes probably 5 minutes on a pentium D.  An implementation
>> of this kind of subsetting using hashtables in C# takes a neglible
>> amount of time.
>>
>> I am open to any suggestions about data format, methods, anything.
>
> How about:
>
> df <- data.frame(id=c("ID1", "ID2", "ID3"), result=1:3)
>
> ids <- list()
> ids[["List1"]] <- c("ID1", "ID3")
> ids[["List2"]] <- c("ID2")
>
> rownames(df) <- df$id
> lapply(ids, function(id) df[id, ])
>
> Hadley