[R] How to make this for() loop memory efficient?
Ray Brownrigg
Ray.Brownrigg at ecs.vuw.ac.nz
Wed Jan 11 01:21:39 CET 2012
On Wed, 11 Jan 2012, Ray Brownrigg wrote:
> On Wed, 11 Jan 2012, iliketurtles wrote:
> > ##I have 2 columns of data. The first column is unique "event IDs" that
> > represent a phone call made to a customer.
> > ###So, if you see 3 entries together in the first column like follows:
> >
> > matrix(c("call1a","call1a","call1a") )
> >
> > ##then this means that this particular phone call (the first call that's
> > logged in the data set) was transferred
> > ##between 3 different "modules" before the call was terminated.
> >
> > ##The second column is a numerical description of the module the call
> > started with and then got transferred to prior to ##call termination.
> > Now, I'll construct a ##representative array of the type of data I'm
> > dealing with (the real data set goes ##on for X00,000s of rows):
> > ##(Ignore how I construct the following array, it’s completely unrelated
> > to how the actual data set was constructed).
> >
> >
> > a<-sapply(1:50,function(i){paste("call",i,sep="",collapse="")})
> > development.a<-seq(1,40,3)
> > development.a2<-seq(1,40,5)
> > a[development.a]<-a[development.a+1]
> > a[development.a2]<-a[development.a2+1]
> > a[1:2]<-"call2a";a[3]<-"call3a";a[4:5]<-"call5a";a[6:8]<-"call8a";a[9]<-"
> > ca ll9a"
> > b<-c(920010,960010,820009,920010,960500,970050,930010,920010,960500,97005
> > 0
> > ,930900,870010,840010,960500,920010,970050,930010,960500,920010,970050,9
> > 300
> > 10,960010,920010,940010,960010,970010,960500,920010,970050,930010,960500
> > ,92
> > 0010,970050,930010,960500,920010,970050,930010,920010,960500,970050,9300
> > 10, 920009,960500,970050,930009,940010,960500,960500,960500)
> > data<-as.data.frame(cbind(a,b))
> > colnames(data)<-c("phone calls","modules")
> > dim(data)
> > print(data[1:10,]) #sample of 10 rows
> >
> > # Note that in the real data set, data[,2] ranges from 810,000 to
> > 999,999. I've been tasked with the following:
> > # "For each phone call that BEGINS with the module which is denoted by 81
> > (i.e. of the form 81X,XXX), what is the expected number of modules in
> > these calls?"
> > #Then it's the same question for each module beginning with 82, 83,
> > 84..... all the way until 99.
> > #I've created code that I think works for this, but I can't actually run
> > it on the whole data set. I left it for 30 minutes and it only had about
> > #5% of the task completed (I clicked "STOP" then checked my output to
> > see if I did it properly, and it seems correct).
> > #I know the apply() family specializes in vector operations, but I can't
> > figure out how to complete the above question in any way other than
> > #loops.
> >
> > L<-data
> >
> > A<-array(0,dim=c(19,2));rownames(A)<-seq(81,99,1)
> > A<-data.frame(A)
> >
> > for(i in 1:(nrow(L)-1))
> > {
> >
> > if(L[(i+1),1]!=L[i,1])
> > {
> >
> > A[paste(strsplit(as.character(L[i+1,2]),"")[[1]][1:2],sep="",collapse="")
> > ,1 ]<- {
> >
> > A[paste(strsplit(as.character(L[i+1,2]),"")[[1]][1:2],sep="",collapse="")
> > ,1 ]+length(grep(as.character(L[i+1,1]),L[,1],value=FALSE)) #aggregate
> > number of modules in the calls that begin with XX (not yet averaged).
> >
> > }
> >
> > A[paste(strsplit(as.character(L[i+1,2]),"")[[1]][1:2],sep="",collapse="")
> > ,2 ]<- {
> >
> > A[paste(strsplit(as.character(L[i+1,2]),"")[[1]][1:2],sep="",collapse="")
> > ,2 ]+1 }
> >
> > }
> >
> > }
> >
> > #If I can get this code to be more memory efficient such that I can do it
> > on a 400,000 row data set, I can do, for example,
> >
> > A[17,1]/A[17,2]
> >
> > #and I'll arrive at the mean number of modules per call where the call
> > starts with a module that starts with 97.
> >
> > A[17,1]
> > #is 10, which means that, out of every single call that started with a
> > module of 97X,XXX,
> > #they went through 10 modules in total.
> >
> > A[17,2]
> > #is 6, which means that there was 6 calls in total that began with a
> > 97X,XXX module.
> >
> > #Hence,
> >
> >
> > A[17,1]/A[17,2]
> >
> > #is the average number of modules that were executed in all the calls
> > that began with a 97X,XXX module.
> >
> >
> > -----
> > ----
> >
> > Isaac
> > Research Assistant
> > Quantitative Finance Faculty, UTS
>
> I don't see any need for you to use data frames.
>
> If you make A and data (not a good use of a variable name) just matrices,
> you get the same answers at about 10 times the speed (using your example).
>
Further, you should calculate your rowname, namely:
paste(strsplit(as.character(L[i+1,2]),"")[[1]][1:2],sep="",collapse="")
only once each loop, instead of 4 times. this saves another 25-30% cputime.
And you can combine the two updates into a single assignment.
So using the code:
L <- as.matrix(data)
A <- array(0, dim=c(19, 2)); rownames(A) <- seq(81, 99, 1)
# A <- data.frame(A)
for(i in 1:(nrow(L)-1))
{
if(L[(i+1),1]!=L[i,1])
{
myrow <- paste(strsplit(as.character(L[i+1, 2]), "")[[1]][1:2], sep="",
collapse="")
A[myrow, ] <- A[myrow, ] +
c(length(grep(as.character(L[i+1, 1]), L[, 1], value=FALSE)), 1)
}
}
is 15 times as fast as your original code.
> Hope this helps,
> Ray Brownrigg
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html and provide commented,
> minimal, self-contained, reproducible code.
More information about the R-help
mailing list