[R] Big data and column correspondence problem

Daniel Malter daniel at umd.edu
Wed Jul 27 08:45:15 CEST 2011


If A has more columns than in your example, you could always try to only
merge those columns of A with B that are relevant for the merging. You could
then cbind the result of the merging back together with the rest of A as
long as the merged data preserved the same order as in A.

Alternatively, you can always use chunks of A and do the merging separately,
e.g., for blocks of 10000 observations or so.

x<-sample(1:150,15000,replace=T)
y<-sample(1:150,15000,replace=T)
a<-rnorm(15000)
b<-rnorm(15000)
A<-cbind(x,a)
B<-cbind(y,b)
system.time(newdata<-merge(A,B,by.x='x',by.y='y',all.x=T,all.y=F))

On a MacBook Pro with 4 Gs of RAM and a 2.4 GHz Duo Core processor it would
take you about 40 minutes if you do chunks for 15000 observations. I am not
sure whether the loop would be slower than that.

On a different note: how are you matching if AA has multiple matches in BB?

Daniel


murilofm wrote:
> 
> Thanks Daniel, that helped me. Based on your suggestions I built this
> final code:
> 
> library(foreign)
> library(gdata)
> 
> AA = c(4,4,4,2,2,6,8,9) 
> A1 = c(3,3,11,5,5,7,11,12) 
> A2 = c(3,3,7,3,5,7,11,12) 
> A = cbind(AA, A1, A2) 
> 
> BB = c(2,2,4,6,6) 
> B1 =c(5,11,7,13,NA) 
> B2 =c(4,12,11,NA,NA) 
> B3 =c(12,13,NA,NA,NA) 
> 
> A = cbind(AA, A1, A2,0) 
> B=cbind(BB,B1,B2,B3) 
> 
> newdata<-merge(A,B,by.x='AA',by.y='BB',all.x=T,all.y=F)
> newdata$dum <- rowSums (newdata[,matchcols(newdata,
> with=c("B"))]==newdata$A1, na.rm = FALSE, dims = 1)*
> rowSums (newdata[,matchcols(newdata, with=c("B"))]==newdata$A2, na.rm =
> FALSE, dims = 1)
> 
> colnames(A)[4]<-"dum"
> newdata$dum1<-newdata$dum
> A_final<-merge(A,newdata,by.x=c("AA","A1","A2","dum"),by.y=c("AA","A1","A2","dum"),all.x=T,all.y=F)
> 
> Which gives me the same result of the "loop" version. Unfortunately, I
> can't replicate it on the original data since i can't make the merge work:
> i get an error message "Reached total allocation of 4090Mb". So, I'm stuck
> again.
> 
> If anyone could shed some light on this problem, i would really
> appreciate.
> 

--
View this message in context: http://r.789695.n4.nabble.com/Big-data-and-column-correspondence-problem-tp3694912p3697709.html
Sent from the R help mailing list archive at Nabble.com.



More information about the R-help mailing list