[R] Competing with SPSS and SAS: improving code that loops through rows (data manipulation)

Fri Mar 26 22:05:51 CET 2010

Dear R-ers,

In my question there are no statistics involved - it's all about data
manipulation in R.
I am trying to write a code that should replace what's currently being
done in SAS and SPSS. Or, at least, I am trying to show to my
colleagues R is not much worse than SAS/SPSS for the task at hand.
I've written a code that works but it's too slow. Probably because
it's looping through a lot of things. But I am not seeing how to
improve it. I've already written a different code but it's 5 times
slower than this one. The code below takes me slightly above 5 sec for
the tiny data set. I've tried using it with a real one - was not done
after hours.
Need help of the list! Maybe someone will have an idea on how to
increase the efficiency of my code (just one block of it - in the
"DATA TRANSFORMATION" Section below)?

Below - I am creating the data set whose structure is similar to the
data sets the code should be applied to. Also - I have desribed what's
actually being done - in comments.
Thanks a lot to anyone for any suggestion!

Dimitri

###### CREATING THE TEST DATA SET ################################

set.seed(123)
data<-data.frame(group=c(rep("first",10),rep("second",10)),week=c(1:10,1:10),a=abs(round(rnorm(20)*10,0)),
b=abs(round(rnorm(20)*100,0)))
data
dim(data)[1]  # !!! In real life I might have up to 150 (!) rows
(weeks) within each subgroup

### Specifying parameters used in the code below:
vars<-names(data)[3:4] # names of variables to be transformed
nr.vars<-length(vars) # number of variables to be transformed;  !!!
in real life I'll have to deal with up to 50-60 variables, not 2.
group.var<-names(data)[1] # name of the grouping variable
subgroups<-levels(data[[group.var]]) # names of subgroups;  !!! in
real life I'll have up to 20-25 subgroups, not 2.

# For EACH subgroup: indexing variables a and b to their maximum in
that subgroup;
# Further, I'll have to use these indexed variables to build the new ones:
for(i in vars){
	new.name<-paste(i,".ind.to.max",sep="")
	data[[new.name]]<-NA
}

indexed.vars<-names(data)[grep("ind.to.max$", names(data))] #
variables indexed to subgroup max
for(subgroup in subgroups){
	data[data[[group.var]] %in%
subgroup,indexed.vars]<-lapply(data[data[[group.var]] %in%
subgroup,vars],function(x){
		y<-x/max(x)
		return(y)
	})
}
data

############# DATA TRANSFORMATION #########################################

# Objective: Create new variables based on the old ones (a and b ind.to.max)
# For each new variable, the value in a given row is a function of (a)
2 constants (that have several levels each),
# (b) the corresponding value of the original variable (e.g.,
a.ind.to.max"), and the value in the previous row on the same new
variable
# PLUS: - it has to be done by subgroup (variable "group")

constant1<-c(1:3)            # constant 1 used for transformation -
has 3 levels;  !!! in real life it will have up to 7 levels
constant2<-seq(.15,.45,.15)  # constant 2 used for transformation -
has 3 levels;  !!! in real life it will have up to 7 levels

# CODE THAT IS TOO SLOW (it uses parameters specified in the previous
code section):
start1<-Sys.time()
for(var in indexed.vars){     # looping through variables
  for(c1 in 1:length(constant1)){     # looping through levels of constant1
 	  for(c2 in 1:length(constant2)){    # looping through levels of constant2
      d=log(0.5)/constant1[c1]
      l=-log(1-constant2[c2])
      name<-paste(strsplit(var,".ind.to.max"),constant1[c1],constant2[c2]*100,"..transf",sep=".")
      data[[name]]<-NA
      for(subgroup in subgroups){     # looping through subgroups
        data[data[[group.var]] %in% subgroup, name][1] =
1-((1-0*exp(1)^d)/(exp(1)^(data[data[[group.var]] %in% subgroup,
var][1]*l*10)))  # this is just the very first row of each subgroup
        for(case in 2:nrow(data[data[[group.var]] %in% subgroup, ])){
  # looping through the remaining rows of the subgroup
          data[data[[group.var]] %in% subgroup, name][case]=
1-((1-data[data[[group.var]] %in% subgroup,
name][case-1]*exp(1)^d)/(exp(1)^(data[data[[group.var]] %in% subgroup,
var][case]*l*10)))
  			}
  		}
  	}
  }
}
end1<-Sys.time()
print(end1-start1) # Takes me ~0.53 secs
names(data)
data

-- 
Dimitri Liakhovitski
Ninah.com
Dimitri.Liakhovitski at ninah.com