[R] Fastest way to repeatedly subset a data frame?

Fri Apr 20 20:48:26 CEST 2007

On 4/20/07, Iestyn Lewis <ilewis at pharm.emory.edu> wrote:
> Hi -
>
>  I have a data frame with a large number of observations (62,000 rows,
> but only 2 columns - a character ID and a result list).
>
> Sample:
>
>  > my.df <- data.frame(id=c("ID1", "ID2", "ID3"), result=1:3)
>  > my.df
>    id result
> 1 ID1      1
> 2 ID2      2
> 3 ID3      3
>
> I have a list of ID vectors.  This list will have anywhere from 100 to
> 1000 members, and each member will have anywhere from 10 to 5000 id entries.
>
> Sample:
>
>  > my.idlist[["List1"]] <- c("ID1", "ID3")
>  > my.idlist[["List2"]] <- c("ID2")
>  > my.idlist
> $List1
> [1] "ID1" "ID3"
>
> $List2
> [1] "ID2"
>
>
> I need to subset that data frame by the list of IDs in each vector, to
> end up with vectors that contain just the results for the IDs found in
> each vector in the list.  My current approach is to create new columns
> in the original data frame with the names of the list items, and any
> results that don't match replaced with NA.  Here is what I've done so far:
>
> createSubsets <- function(res, slib) {
>     for(i in 1:length(slib)) {
>         res[ ,names(slib)[i]] <- replace(res$result,
> which(!is.element(res$sid, slib[[i]])), NA)
>         return (res)
>     }
> }
>
> I have 2 problems:
>
> 1)  My function only works for the first item in the list:
>
>  > my.df <- createSubsets(my.df, my.idlist)
>  > my.df
>    id result List1
> 1 ID1      1     1
> 2 ID2      2    NA
> 3 ID3      3     3
>
> In order to get all results, I have to copy the loop out of the function
> and paste it into R directly.
>
> 2)  It is very, very slow.  For a dataset of 62,000 rows and 253 list
> entries, it takes probably 5 minutes on a pentium D.  An implementation
> of this kind of subsetting using hashtables in C# takes a neglible
> amount of time.
>
> I am open to any suggestions about data format, methods, anything.

How about:

df <- data.frame(id=c("ID1", "ID2", "ID3"), result=1:3)

ids <- list()
ids[["List1"]] <- c("ID1", "ID3")
ids[["List2"]] <- c("ID2")

rownames(df) <- df$id
lapply(ids, function(id) df[id, ])

Hadley