[R] Competing with SPSS and SAS: improving code that loops through rows (data manipulation)
Dimitri Liakhovitski
ld7631 at gmail.com
Fri Mar 26 22:05:51 CET 2010
Dear R-ers,
In my question there are no statistics involved - it's all about data
manipulation in R.
I am trying to write a code that should replace what's currently being
done in SAS and SPSS. Or, at least, I am trying to show to my
colleagues R is not much worse than SAS/SPSS for the task at hand.
I've written a code that works but it's too slow. Probably because
it's looping through a lot of things. But I am not seeing how to
improve it. I've already written a different code but it's 5 times
slower than this one. The code below takes me slightly above 5 sec for
the tiny data set. I've tried using it with a real one - was not done
after hours.
Need help of the list! Maybe someone will have an idea on how to
increase the efficiency of my code (just one block of it - in the
"DATA TRANSFORMATION" Section below)?
Below - I am creating the data set whose structure is similar to the
data sets the code should be applied to. Also - I have desribed what's
actually being done - in comments.
Thanks a lot to anyone for any suggestion!
Dimitri
###### CREATING THE TEST DATA SET ################################
set.seed(123)
data<-data.frame(group=c(rep("first",10),rep("second",10)),week=c(1:10,1:10),a=abs(round(rnorm(20)*10,0)),
b=abs(round(rnorm(20)*100,0)))
data
dim(data)[1] # !!! In real life I might have up to 150 (!) rows
(weeks) within each subgroup
### Specifying parameters used in the code below:
vars<-names(data)[3:4] # names of variables to be transformed
nr.vars<-length(vars) # number of variables to be transformed; !!!
in real life I'll have to deal with up to 50-60 variables, not 2.
group.var<-names(data)[1] # name of the grouping variable
subgroups<-levels(data[[group.var]]) # names of subgroups; !!! in
real life I'll have up to 20-25 subgroups, not 2.
# For EACH subgroup: indexing variables a and b to their maximum in
that subgroup;
# Further, I'll have to use these indexed variables to build the new ones:
for(i in vars){
new.name<-paste(i,".ind.to.max",sep="")
data[[new.name]]<-NA
}
indexed.vars<-names(data)[grep("ind.to.max$", names(data))] #
variables indexed to subgroup max
for(subgroup in subgroups){
data[data[[group.var]] %in%
subgroup,indexed.vars]<-lapply(data[data[[group.var]] %in%
subgroup,vars],function(x){
y<-x/max(x)
return(y)
})
}
data
############# DATA TRANSFORMATION #########################################
# Objective: Create new variables based on the old ones (a and b ind.to.max)
# For each new variable, the value in a given row is a function of (a)
2 constants (that have several levels each),
# (b) the corresponding value of the original variable (e.g.,
a.ind.to.max"), and the value in the previous row on the same new
variable
# PLUS: - it has to be done by subgroup (variable "group")
constant1<-c(1:3) # constant 1 used for transformation -
has 3 levels; !!! in real life it will have up to 7 levels
constant2<-seq(.15,.45,.15) # constant 2 used for transformation -
has 3 levels; !!! in real life it will have up to 7 levels
# CODE THAT IS TOO SLOW (it uses parameters specified in the previous
code section):
start1<-Sys.time()
for(var in indexed.vars){ # looping through variables
for(c1 in 1:length(constant1)){ # looping through levels of constant1
for(c2 in 1:length(constant2)){ # looping through levels of constant2
d=log(0.5)/constant1[c1]
l=-log(1-constant2[c2])
name<-paste(strsplit(var,".ind.to.max"),constant1[c1],constant2[c2]*100,"..transf",sep=".")
data[[name]]<-NA
for(subgroup in subgroups){ # looping through subgroups
data[data[[group.var]] %in% subgroup, name][1] =
1-((1-0*exp(1)^d)/(exp(1)^(data[data[[group.var]] %in% subgroup,
var][1]*l*10))) # this is just the very first row of each subgroup
for(case in 2:nrow(data[data[[group.var]] %in% subgroup, ])){
# looping through the remaining rows of the subgroup
data[data[[group.var]] %in% subgroup, name][case]=
1-((1-data[data[[group.var]] %in% subgroup,
name][case-1]*exp(1)^d)/(exp(1)^(data[data[[group.var]] %in% subgroup,
var][case]*l*10)))
}
}
}
}
}
end1<-Sys.time()
print(end1-start1) # Takes me ~0.53 secs
names(data)
data
--
Dimitri Liakhovitski
Ninah.com
Dimitri.Liakhovitski at ninah.com
More information about the R-help
mailing list