[R] Code is too slow: mean-centering variables in a data frame by subgroup
Charles C. Berry
cberry at tajo.ucsd.edu
Tue Mar 30 18:24:20 CEST 2010
On Tue, 30 Mar 2010, Dimitri Liakhovitski wrote:
> Dear R-ers,
>
> I have a large data frame (several thousands of rows and about 2.5
> thousand columns). One variable ("group") is a grouping variable with
> over 30 levels. And I have a lot of NAs.
> For each variable, I need to divide each value by variable mean - by
> subgroup. I have the code but it's way too slow - takes me about 1.5
> hours.
> Below is a data example and my code that is too slow. Is there a
> different, faster way of doing the same thing?
> Thanks a lot for your advice!
>
> Dimitri
>
>
> # Building an example frame - with groups and a lot of NAs:
> set.seed(1234)
> frame<-data.frame(group=rep(paste("group",1:10),10),a=rnorm(1:100),b=rnorm(1:100),c=rnorm(1:100),d=rnorm(1:100),e=rnorm(1:100),f=rnorm(1:100),g=rnorm(1:100))
Use model.matrix and crossprod to do this in a vectorized fashion:
> mat <- as.matrix(frame[,-1])
> mm <- model.matrix(~0+group,frame)
> col.grp.N <- crossprod( !is.na(mat), mm )
> mat[is.na(mat)] <- 0.0
> col.grp.sum <- crossprod( mat, mm )
> mat <- mat / ( t(col.grp.sum/col.grp.N)[ frame$group,] )
> is.na(mat) <- is.na(frame[,-1])
>
mat is now a matrix whose columns each correspond to the columns in
'frame' as you have it after do.call(...)
Are you sure you want to divide the values by their (possibly negative)
means??
HTH,
Chuck
> frame<-frame[order(frame$group),]
> names.used<-names(frame)[2:length(frame)]
> set.seed(1234)
> for(i in names.used){
> i.for.NA<-sample(1:100,60)
> frame[[i]][i.for.NA]<-NA
> }
> frame
>
> ### Code that does what's needed but is too slow:
> Start<-Sys.time()
> frame <- do.call(cbind, lapply(names.used, function(x){
> unlist(by(frame, frame$group, function(y) y[,x] / mean(y[,x],na.rm=T)))
> }))
> Finish<-Sys.time()
> print(Finish-Start) # Takes too long
>
> --
> Dimitri Liakhovitski
> Ninah.com
> Dimitri.Liakhovitski at ninah.com
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
Charles C. Berry (858) 534-2098
Dept of Family/Preventive Medicine
E mailto:cberry at tajo.ucsd.edu UC San Diego
http://famprevmed.ucsd.edu/faculty/cberry/ La Jolla, San Diego 92093-0901
More information about the R-help
mailing list