[R] Competing with SPSS and SAS: improving code that loops throughrows (data manipulation)

Martin Morgan mtmorgan at fhcrc.org
Sat Mar 27 16:43:32 CET 2010


On 03/26/2010 06:40 PM, Dimitri Liakhovitski wrote:
> My sincere apologies if it looked large. Let me try again with less code.
> It's hard to do less than that. In fact - there is nothing in this
> code but 1 formula and many loops, which is the problem I am not sure
> how to solve.
> I also tried to be as clear as possible with the comments.
> Dimitri
> 
> ## START OF THE CODE TO PRODUCE SMALL DATA EXAMPLE
>   set.seed(123)
>   data<-data.frame(group=c(rep("first",10),rep("second",10)),a=abs(round(rnorm(20,mean=0,
> sd=.55),2)), b=abs(round(rnorm(20,mean=0, sd=.55),2)))
>   data                   # "data" it is the data frame to work with
> ## END OF THE CODE TO PRODUCE SMALL DATA EXAMPLE. In real life "data"
> would contain up to 150-200 rows PER SUBGROUP
> 
> ### Specifying useful parameters used in the slow code below:
> vars<-names(data)[2:3]                    # names of variables used in
> transformation; in real life - up to 50-60 variables
> group.var<-names(data)[1]                # name of the grouping variable
> subgroups<-levels(data[[group.var]])   # names of subgroups; in real
> life - up to 30 subgroups
> 
> # OBJECTIVE:
> # Need to create new variables based on the old ones (a & b)
> # For each new variable, the value in a given row is a function of (a)
> 2 constants (that have several levels each),
> # (b) value of the original variable (e.g., a.ind.to.max"), and the
> value in the previous row on the same new variable
> # Plus - it has to be done by subgroup (variable "group")
> 
> # Defining 2 constants:
> constant1<-c(1:3)                # constant 1 used in transformation -
> has 3 levels, in real life - up to 7 levels
> constant2<-seq(.15,.45,.15)  # constant 2 used in transformation - has
> 3 levels, in real life - up to 7 levels
> 
> ### CODE THAT IS SLOW. Reason - too many loops with the inner-most
> loop being very slow - as it is looping through rows:
> 
> for(var in vars){                               # looping through variables
>   for(c1 in 1:length(constant1)){        # looping through values of constant1
>        for(c2 in 1:length(constant2)){   # looping through values of constant2
>          d=log(0.5)/constant1[c1]
>           l=-log(1-constant2[c2])
>           name<-paste(var,constant1[c1],constant2[c2]*100,".transf",sep=".")
>           data[[name]]<-NA
>           for(subgroup in subgroups){     # looping through subgroups
>             data[data[[group.var]] %in% subgroup, name][1] =
> 1-((1-0*exp(1)^d)/(exp(1)^(data[data[[group.var]] %in% subgroup,
> var][1]*l*10)))
> 
>       ### THIS SECTION IS THE SLOWEST - BECAUSE I AM LOOPING THROUGH ROWS:
>              for(case in 2:nrow(data[data[[group.var]] %in% subgroup,
> ])){ # looping through rows
>                data[data[[group.var]] %in% subgroup, name][case]=
> 1-((1-data[data[[group.var]] %in% subgroup,
> name][case-1]*exp(1)^d)/(exp(1)^(data[data[[group.var]] %in% subgroup,
> var][case]*l*10)))
>              }
>       ### END OF THE SLOWEST SECTION (INNERMOST LOOP)

This is a good place to start! Look at the inner loop

I notice there are a lot of expressions whose values don't change, but
are calculated repeatedly, e.g., exp(1) and data[[group.var]] %in%
subgroup. Let's move those common subexpressions out of the loop, e.g.,

exp1 <- exp(1)
idx <- data[[group.var]] %in% subgroup

for(case in 2:nrow(data[idx,])) {
    data[idx, name][case]=
        1-((1-data[idx, name][case-1]*exp1^d) /
           (exp1^(data[idx, var][case]*l*10)))
}

wow that's already looking better! Looks like data[idx,var] could also
come out of the loop, and data[idx, name], too, so long as we assign
back in at the end

exp1 <- exp(1)
idx <- data[[group.var]] %in% subgroup
a <- data[idx, var]
x <- data[idx, name]

for(case in 2:length(x)) {
    x[case] <- 1 - (1 - x[case-1]*exp1^d) /
        (exp1^(a[case]*l*10))
}

data[idx, name] <- x

Still some redundant calculations, e.g., exp1^d, l*10. Some of the
things we've moved out of the innermost loop can be moved even further
out (e.g., exp1 all the way to the top). And a calculation like

  exp1^(a[case] * l * 10)

would be better written out of the loop as

  b <- exp1^(a * l * 10)

Why not try re-implementing your code with these ideas in mind? It'll be
important to make sure that the results of your original and new
implementation are the same (using all.equal, for instance). The 'CODE
THAT IS SLOW' part of your original code took 0.35s to execute on my
machine (using system.time), whereas my revised code with changes like
that above, and a couple of other small things, ran in 0.01s.

What's your revised code look like now?

Martin

>           }
>       }
>    }
> }
> 
> ### END OF THE CODE
> 
> On Fri, Mar 26, 2010 at 5:25 PM, Bert Gunter <gunter.berton at gene.com> wrote:
>> Dmitri:
>>
>> If you follow the R posting guide you're more likely to get useful replies.
>> In particular it asks for **small** reproducible examples -- your example is
>> far more code then I care to spend time on anyway (others may be more
>> willing or more able to do so of course). I suggest you try (if you haven't
>> already):
>>
>> 1. Profiling the code using Rprof to isolate where the time is spent.And
>> then...
>>
>> 2. Writing a **small** reproducible example to exercise that portion of the
>> code and post it with your question to the list. If you need to...
>> Typically, if you do these things you'll  figure out how to fix the
>> situation on your own.
>>
>> Cheers,
>>
>> Bert Gunter
>> Genentech Nonclinical Statistics
>>
>> -----Original Message-----
>> From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On
>> Behalf Of Dimitri Liakhovitski
>> Sent: Friday, March 26, 2010 2:06 PM
>> To: r-help
>> Subject: [R] Competing with SPSS and SAS: improving code that loops
>> throughrows (data manipulation)
>>
>> Dear R-ers,
>>
>> In my question there are no statistics involved - it's all about data
>> manipulation in R.
>> I am trying to write a code that should replace what's currently being
>> done in SAS and SPSS. Or, at least, I am trying to show to my
>> colleagues R is not much worse than SAS/SPSS for the task at hand.
>> I've written a code that works but it's too slow. Probably because
>> it's looping through a lot of things. But I am not seeing how to
>> improve it. I've already written a different code but it's 5 times
>> slower than this one. The code below takes me slightly above 5 sec for
>> the tiny data set. I've tried using it with a real one - was not done
>> after hours.
>> Need help of the list! Maybe someone will have an idea on how to
>> increase the efficiency of my code (just one block of it - in the
>> "DATA TRANSFORMATION" Section below)?
>>
>> Below - I am creating the data set whose structure is similar to the
>> data sets the code should be applied to. Also - I have desribed what's
>> actually being done - in comments.
>> Thanks a lot to anyone for any suggestion!
>>
>> Dimitri
>>
>> ###### CREATING THE TEST DATA SET ################################
>>
>> set.seed(123)
>> data<-data.frame(group=c(rep("first",10),rep("second",10)),week=c(1:10,1:10)
>> ,a=abs(round(rnorm(20)*10,0)),
>> b=abs(round(rnorm(20)*100,0)))
>> data
>> dim(data)[1]  # !!! In real life I might have up to 150 (!) rows
>> (weeks) within each subgroup
>>
>> ### Specifying parameters used in the code below:
>> vars<-names(data)[3:4] # names of variables to be transformed
>> nr.vars<-length(vars) # number of variables to be transformed;  !!!
>> in real life I'll have to deal with up to 50-60 variables, not 2.
>> group.var<-names(data)[1] # name of the grouping variable
>> subgroups<-levels(data[[group.var]]) # names of subgroups;  !!! in
>> real life I'll have up to 20-25 subgroups, not 2.
>>
>> # For EACH subgroup: indexing variables a and b to their maximum in
>> that subgroup;
>> # Further, I'll have to use these indexed variables to build the new ones:
>> for(i in vars){
>>        new.name<-paste(i,".ind.to.max",sep="")
>>        data[[new.name]]<-NA
>> }
>>
>> indexed.vars<-names(data)[grep("ind.to.max$", names(data))] #
>> variables indexed to subgroup max
>> for(subgroup in subgroups){
>>        data[data[[group.var]] %in%
>> subgroup,indexed.vars]<-lapply(data[data[[group.var]] %in%
>> subgroup,vars],function(x){
>>                y<-x/max(x)
>>                return(y)
>>        })
>> }
>> data
>>
>> ############# DATA TRANSFORMATION #########################################
>>
>> # Objective: Create new variables based on the old ones (a and b ind.to.max)
>> # For each new variable, the value in a given row is a function of (a)
>> 2 constants (that have several levels each),
>> # (b) the corresponding value of the original variable (e.g.,
>> a.ind.to.max"), and the value in the previous row on the same new
>> variable
>> # PLUS: - it has to be done by subgroup (variable "group")
>>
>> constant1<-c(1:3)            # constant 1 used for transformation -
>> has 3 levels;  !!! in real life it will have up to 7 levels
>> constant2<-seq(.15,.45,.15)  # constant 2 used for transformation -
>> has 3 levels;  !!! in real life it will have up to 7 levels
>>
>> # CODE THAT IS TOO SLOW (it uses parameters specified in the previous
>> code section):
>> start1<-Sys.time()
>> for(var in indexed.vars){     # looping through variables
>>  for(c1 in 1:length(constant1)){     # looping through levels of constant1
>>          for(c2 in 1:length(constant2)){    # looping through levels of
>> constant2
>>      d=log(0.5)/constant1[c1]
>>      l=-log(1-constant2[c2])
>>
>> name<-paste(strsplit(var,".ind.to.max"),constant1[c1],constant2[c2]*100,"..t
>> ransf",sep=".")
>>      data[[name]]<-NA
>>      for(subgroup in subgroups){     # looping through subgroups
>>        data[data[[group.var]] %in% subgroup, name][1] =
>> 1-((1-0*exp(1)^d)/(exp(1)^(data[data[[group.var]] %in% subgroup,
>> var][1]*l*10)))  # this is just the very first row of each subgroup
>>        for(case in 2:nrow(data[data[[group.var]] %in% subgroup, ])){
>>  # looping through the remaining rows of the subgroup
>>          data[data[[group.var]] %in% subgroup, name][case]=
>> 1-((1-data[data[[group.var]] %in% subgroup,
>> name][case-1]*exp(1)^d)/(exp(1)^(data[data[[group.var]] %in% subgroup,
>> var][case]*l*10)))
>>                        }
>>                }
>>        }
>>  }
>> }
>> end1<-Sys.time()
>> print(end1-start1) # Takes me ~0.53 secs
>> names(data)
>> data
>>
>> --
>> Dimitri Liakhovitski
>> Ninah.com
>> Dimitri.Liakhovitski at ninah.com
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>>
> 
> 
> 


-- 
Martin Morgan
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M1 B861
Phone: (206) 667-2793



More information about the R-help mailing list