[R] Group by a data frame with multiple columns
    arun 
    smartpink111 at yahoo.com
       
    Sun Aug  4 06:07:32 CEST 2013
    
    
  
Hi,
May be you should try ?data.table().  
Please use ?dput(). 
dat1<- read.table(text="
Area Sex Year y
Bob F 2011 1
Bob F 2011 2
Bob F 2012 3
Bob M 2012 3
Bob M 2012 2
Fred F 2011 1
Fred F 2011 1
Fred F 2012 2
Fred M 2012 3
Fred M 2012 1
",sep="",header=TRUE,stringsAsFactors=FALSE)
library(data.table)
 dt2<-dt1[,sum(y),by=list(Area,Sex,Year)]
 dt2
#   Area Sex Year V1
#1:  Bob   F 2011  3
#2:  Bob   F 2012  3
#3:  Bob   M 2012  5
#4: Fred   F 2011  2
#5: Fred   F 2012  2
#6: Fred   M 2012  4
#Speed
set.seed(28)
dat2<- data.frame(Area=sample(LETTERS,1e7,replace=TRUE),Sex=sample(c("F","M"),1e7,replace=TRUE),Year=sample(2005:2012,1e7,replace=TRUE),y=sample(1:10,1e7,replace=TRUE))
system.time(datTest<- aggregate(y~.,data=dat2,sum))
#   user  system elapsed 
# 18.056   1.336  19.424 
datTest2<- datTest[order(datTest$Area,datTest$Sex,datTest$Year),]
row.names(datTest2)<- 1:nrow(datTest2)
dtTest<- data.table(dat2)
 system.time({
 setkey(dtTest,Area,Sex,Year)
dtTest2<- dtTest[,sum(y),by=list(Area,Sex,Year)]})
# user  system elapsed 
#  1.232   0.184   1.418 
 setnames(dtTest2,"V1","y")
identical(datTest2,as.data.frame(dtTest2))
#[1] TRUE
A.K.
----- Original Message -----
From: Michael Liaw <michael.liaw at hotmail.com>
To: r-help at r-project.org
Cc: 
Sent: Saturday, August 3, 2013 8:11 PM
Subject: [R] Group by a data frame with multiple columns
Hi
I'm trying to manipulate a data frame (that has about 10 million rows) rows
by "grouping" it with multiple columns. For example, say the data set looks
like:
Area
Sex
Year
y
Bob
F
2011
1
Bob
F
2011
2
Bob
F
2012
3
Bob
M
2012
3
Bob
M
2012
2
Fred
F
2011
1
Fred
F
2011
1
Fred
F
2012
2
Fred
M
2012
3
Fred
M
2012
1
And I want it to look like
Area
Sex
Year
Sum of y
Bob
F
2011
3
Bob
F
2012
3
Bob
M
2012
5
Fred
F
2011
2
Fred
F
2012
2
Fred
M
2012
4
I think I can use something like:
tmp <- aggregate (y ~ ., sum)
But due to the size it's really taking a strain on the computer (even with
64-bit R on a, yes unfortunately Windows, machine with 16GB RAM :().  The
reason for me wanting the data set to get into this form is I want to then
apply the population information and get the "rate" on the "sum of y" column
then fit a Poisson regression model.
I'm wondering (and would appreciate comments) whether there is a more
efficient way to the process I described? 
Cheers
Michael
    [[alternative HTML version deleted]]
______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
    
    
More information about the R-help
mailing list