[R] aggregate / collapse big data frame efficiently
arun
smartpink111 at yahoo.com
Tue Dec 25 21:22:58 CET 2012
Hi,
Jim's method was found to be faster than data.table()
n <- 10000
nLevels <- 10
nRows <- 120
Cols <- list(rep(list(sample(nRows)), n))
df <- data.frame(levels = sample(nLevels, nRows, TRUE), Cols)
colnames(df)[-1] <- paste0('col', 1:n)
# convert to matrix for faster processing
df.m <- as.matrix(df[, -1]) # remove levels column
system.time({
# split the indices of rows for each level
x <- split(seq(nrow(df)), df$levels)
result <- sapply(x, function(a) colMeans(df.m[a, ]))
})
# user system elapsed
# 0.056 0.000 0.056
library(data.table)
df.dt<-data.table(df)
setkey(df.dt,levels)
system.time({ result1<- df.dt[,lapply(.SD,mean),by=levels]})
# user system elapsed
# 7.756 0.000 7.771
system.time({result2<-df.dt[,list(Mean=colMeans(.SD)),by=levels]})
# user system elapsed
# 2.188 0.000 2.193
A.K.
----- Original Message -----
From: jim holtman <jholtman at gmail.com>
To: Martin Batholdy <batholdy at googlemail.com>
Cc: "r-help at r-project.org" <r-help at r-project.org>
Sent: Tuesday, December 25, 2012 1:20 PM
Subject: Re: [R] aggregate / collapse big data frame efficiently
According to the way that you have used 'aggregate', you are taking
the column means. Couple of suggestions for faster processing:
1. use matrices instead of data.frames ( i converted your example just
before using it)
2, use the 'colMeans'
I created a 120 x 100000 matrix with 10 levels and its does the
computation in less than 2 seconds:
> n <- 100000
> nLevels <- 10
> nRows <- 120
> Cols <- list(rep(list(sample(nRows)), n))
> df <- data.frame(levels = sample(nLevels, nRows, TRUE), Cols)
> colnames(df)[-1] <- paste0('col', 1:n)
>
> # convert to matrix for faster processing
> df.m <- as.matrix(df[, -1]) # remove levels column
> str(df.m)
int [1:120, 1:100000] 111 13 106 61 16 39 25 94 53 38 ...
- attr(*, "dimnames")=List of 2
..$ : NULL
..$ : chr [1:100000] "col1" "col2" "col3" "col4" ...
> system.time({
+ # split the indices of rows for each level
+ x <- split(seq(nrow(df)), df$levels)
+ result <- sapply(x, function(a) colMeans(df.m[a, ]))
+ })
user system elapsed
1.33 0.00 1.35
> str(result)
num [1:100000, 1:10] 57 57 57 57 57 57 57 57 57 57 ...
- attr(*, "dimnames")=List of 2
..$ : chr [1:100000] "col1" "col2" "col3" "col4" ...
..$ : chr [1:10] "1" "2" "3" "4" ...
>
On Tue, Dec 25, 2012 at 11:34 AM, Martin Batholdy
<batholdy at googlemail.com> wrote:
> Hi,
>
>
> I need to aggregate rows of a data.frame by computing the mean for rows with the same factor-level on one factor-variable;
>
> here is the sample code:
>
>
> x <- data.frame(rep(letters,2), rnorm(52), rnorm(52), rnorm(52))
>
> aggregate(x, list(x[,1]), mean)
>
>
> Now my problem is, that the actual data-set is much bigger (120 rows and approximately 100.000 columns) – and it takes very very long (actually at some point I just stopped it).
>
> Is there anything that can be done to make the aggregate routine more efficient?
> Or is there a different approach that would work faster?
>
>
> Thanks for any suggestions!
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
--
Jim Holtman
Data Munger Guru
What is the problem that you are trying to solve?
Tell me what you want to do, not how you want to do it.
______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
More information about the R-help
mailing list