[R] improving/speeding up a very large, slow simulation

Thu Feb 14 01:25:44 CET 2013

On Feb 13, 2013, at 3:33 PM, Benjamin Caldwell wrote:

> Joshua,
> 
> Thanks for the example, and the explanation. Nice article that you wrote -
> the figures alone deserve a deeper dive, I think.
> 
> As side note, I found out about profiling on another post and did it - it
> appears that my ignorance about how slow dataframes can be was the main
> problem. After turning the dfs into lists and matrixes, the improvement was
> huge! Wish I had known about this issue ahead of time - it must be
> somewhere in the R help books in big red letters, but if it isn't it should
> be. Seems most of the discussions about speeding up code talks about
> vectorization and avoiding fragmentation.

I don't know if that information is in the unnamed books you read, but it is no surprise to regular readers of Rhelp that `[<-.data.frame` is a bottleneck. It requires a lot of extra overhead to support the possibility of having different datatypes in each column and to maintain row numbering correctly.

-- 
David.

> 
> The below was for five iterations of the simulation, and I imagine was the
> reason it was hanging up and stalling out for large simulations:
> 
> 
> Original:
> 
> $by.total
>                         total.time total.pct self.time self.pct
> simulation.unpaired.c       2857.54     99.97    111.90     3.91
> [<-                         2555.74     89.41      2.78     0.10
> [<-.data.frame              2552.96     89.32   2543.84    89.00
> sequential.unpaired.c        149.48      5.23      2.66     0.09
> unpaired.test.c               92.30      3.23     18.80     0.66
> data.frame                    70.72      2.47      9.68     0.34
> as.data.frame                 42.36      1.48      3.74     0.132
> 
> 
> only using dataframe as the output:
> 
> $by.total
>                      total.time total.pct self.time self.pct
> simulation.unpaired.c     109.14    100.00      0.90     0.82
> sequential.unpaired.c      92.16     84.44      3.28     3.01
> unpaired.test.c            84.48     77.41     21.08    19.31
> sd                         43.68     40.02      5.42     4.97
> 
> 
> Thanks again folks.
> 
> 
> On Tue, Feb 12, 2013 at 9:45 PM, Joshua Wiley <jwiley.psych at gmail.com>wrote:
> 
>> Hi Ben,
>> 
>> That is correct about vectorizing----to some speed tests to see the
>> effects.  They can be quite dramatic (like I have easily seen 300fold
>> speedups from doing that).  The apply functions are essentially
>> loops---they tend to be about the same speed as loops, though slightly
>> less code.
>> 
>> Compiler is nice---not it will not help already vectorized code
>> (because vectorized code really just means code that is linked to C
>> code that uses a compiled loop, but C is much faster than R.
>> 
>> Sure, check out the bootstrapping (midway down) section on this page I
>> wrote: http://www.ats.ucla.edu/stat/r/dae/melogit.htm
>> 
>> Cheers,
>> 
>> Josh
>> 
>> On Tue, Feb 12, 2013 at 12:57 PM, Benjamin Caldwell
>> <btcaldwell at berkeley.edu> wrote:
>>> Joshua,
>>> 
>>> So, to your first suggestion - try to figure out whether some operation
>>> performed on every element of the vector at once could do the same thing
>> -
>>> the answer is yes! Where can I see examples of that implemented?
>>> 
>>> e.g. would it just be
>>> 
>>> a <- 1:10000
>>> b <- 1:10000
>>> c <- 1:10000
>>> d <- data.frame(b,c)
>>> 
>>> vectorize<-function(a, d) {
>>> 
>>> f <- d[,1]+d[,2]+a
>>> f
>>> }
>>> 
>>> out<-vectorize(a=a,d=d)
>>> out
>>> 
>>> vs
>>> 
>>> loop<- function(a,d){
>>> 
>>> for(i in 1:length(d[,1])){
>>> a[i]<-d[i,1]+d[i,2]+a[i]
>>> }
>>> a
>>> }
>>> 
>>> out.l<-loop(a,d)
>>> 
>>> out.l
>>> 
>>> If so, it's vectorization that's the big advantage, not something
>> mysterious
>>> that's happening under the hood with the apply functions?
>>> 
>>> To your second : Compiler - thanks for the tip, that's great to know!
>>> 
>>> To your last, could you please give an example or point me to one that
>> was
>>> useful for you? I don't understand that well enough to use it.
>>> 
>>> Thanks!
>>> 
>>> Ben Caldwell
>>> 
>>> Graduate Fellow
>>> University of California, Berkeley
>>> 130 Mulford Hall #3114
>>> Berkeley, CA 94720
>>> Office 223 Mulford Hall
>>> (510)859-3358
>>> 
>>> 
>>> On Mon, Feb 11, 2013 at 10:10 PM, Joshua Wiley <jwiley.psych at gmail.com>
>>> wrote:
>>>> 
>>>> Hi Ben,
>>>> 
>>>> I appreciate that the example is reproducible, and I understand why
>>>> you shared the entire project.  At the same time, that is an awful lot
>>>> of code for volunteers to go through and try to optimize.
>>>> 
>>>> I would try to think of the structure of your task, see what the
>>>> individual pieces are---figure out what can be reused and optimized.
>>>> Look at loops and try to think whether some operation performed on
>>>> every element of the vector at once could do the same thing.
>>>> Sometimes the next iteration of a loop depends on the previous, so it
>>>> is difficult to vectorize, but often that is not the case.
>>>> 
>>>> Make use of the compiler package, especially cmpfun().  It can give
>>>> fairly substantial speedups, particularly with for loops in R.
>>>> 
>>>> Finally if you want 1000 reps, do you actually need to repeat all the
>>>> steps 1000 times, or could you just simulate 1000x the data?  If the
>>>> latter, do the steps once but make more data.  If the former, you
>>>> ought to be able to parallelize it fairly easily, see the parallel
>>>> package.
>>>> 
>>>> Good luck,
>>>> 
>>>> Josh
>>>> 
>>>> 
>>>> On Mon, Feb 11, 2013 at 9:22 PM, Benjamin Caldwell
>>>> <btcaldwell at berkeley.edu> wrote:
>>>>> Dear R help;
>>>>> 
>>>>> I'll preface this by saying that the example I've provided below is
>>>>> pretty
>>>>> long, turgid, and otherwise a deep dive into a series of functions I
>>>>> wrote
>>>>> for a simulation study. It is, however, reproducible and
>> self-contained.
>>>>> 
>>>>> I'm trying to do my first simulation study that's quite big, and so
>> I'll
>>>>> say that the output of this simulation as I'd like it to be is 9.375
>>>>> million rows (9375 combinations of the inputs * 1000). So, first
>> thing,
>>>>> maybe I should try to cut down on the number of dimensions and do it a
>>>>> piece at a time. If that's a conclusion people come to here, I'll look
>>>>> into
>>>>> paring down/doing it a chunk at a time.
>>>>> 
>>>>> I'm hoping, though, that this is an opportunity for me to learn some
>>>>> more
>>>>> efficient, elegant ways to a. vectorize and b. use the apply
>> functions,
>>>>> which still seem to elude me when it comes to their use with more
>>>>> complex,
>>>>> custom functions that have multiple inputs.
>>>>> 
>>>>> If having looked the below over you can give me suggestions to help
>> make
>>>>> this and future code I write run more quickly/efficiently, I will be
>>>>> most grateful, as as this is currently written it would take what
>> looks
>>>>> like days to run a thousand simulations of each possible combination
>> of
>>>>> variables of interest.
>>>>> 
>>>>> Best
>>>>> 
>>>>> Ben Caldwell
>>>>> 
>>>>> -----------------------------------------------
>>>>> #unpaired
>>>>> verification.plots<-rnorm(30, 100, 10)
>>>>> # writeClipboard(as.character(verification.plots))
>>>>> 
>>>>> unpaired.test<- function(verification.plots, project.n, project.mean,
>>>>> project.sd, allowed.deviation, project.acres, alpha=.05){
>>>>> 
>>>>> verification.plots <-as.numeric(as.character(verification.plots))
>>>>> a <- qnorm(alpha/2)
>>>>> d <- alpha*project.mean
>>>>> 
>>>>> # verifier plot number
>>>>> 
>>>>> n<-length(verification.plots)
>>>>> verifier.plot.number <- c(1:n)
>>>>> 
>>>>> 
>>>>> #all plots (verifier and project)
>>>>> 
>>>>> all.plots.n <- rep(0,n)
>>>>> for(i in 1:n){
>>>>> all.plots.n[i] <- project.n + verifier.plot.number[i]
>>>>> }
>>>>> 
>>>>> #running mean
>>>>> X_bar <- rep(1,n)
>>>>> 
>>>>> for(i in 1:n){
>>>>> X_bar[i]<- mean(verification.plots[1:i])
>>>>> }
>>>>> #running sd
>>>>> sd2 <- NULL
>>>>> 
>>>>> for(i in 2:n){
>>>>> sd2[i]<-sd(verification.plots[1:i])
>>>>> }
>>>>> #Tn
>>>>> Tn<-NULL
>>>>> 
>>>>> for(i in 2:n){
>>>>> Tn[i] <- project.mean-X_bar[i]
>>>>> }
>>>>> #n_Threshold
>>>>> n_thresh<-NULL
>>>>> 
>>>>> for(i in 2:n){
>>>>> n_thresh[i] <- (((a^2)/(d^2))*((project.sd^2)+(sd2[i]^2)))
>>>>> }
>>>>> #Upper
>>>>> upper <- NULL
>>>>> for (i in 2:n){
>>>>> upper[i] <- Tn[i]-d
>>>>> }
>>>>> #Lower
>>>>> lower<- NULL
>>>>> for(i in 2:n){
>>>>> lower[i] <- Tn[i]+d
>>>>> }
>>>>> 
>>>>> #Success.fail
>>>>> success.fail <- NULL
>>>>> 
>>>>> 
>>>>> 
>>>>> success.fail.num <- rep(0,n)
>>>>> 
>>>>> 
>> if(abs(verification.plots[1]-project.mean)<(allowed.deviation*project.mean)){
>>>>> success.fail[1] <- "Pass"
>>>>> success.fail.num[1] <- 1
>>>>> } else {
>>>>> success.fail[1] <- "Fail"
>>>>> success.fail.num[1] <-0
>>>>> }
>>>>> for(i in 2:n){
>>>>> if(all.plots.n[i]>=n_thresh[i]){
>>>>> if(upper[i] <= 0){
>>>>> if(lower[i] >= 0){
>>>>> success.fail[i]<-"Pass"
>>>>> success.fail.num[i]<-1
>>>>> } else {
>>>>> success.fail[i] <- "Fail"
>>>>> success.fail.num[i] <-0}
>>>>> } else {
>>>>> success.fail[i] <- "Fail"
>>>>> success.fail.num[i] <- 0
>>>>> }
>>>>> 
>>>>> } else {
>>>>> success.fail[i] <- "Inconclusive"
>>>>> success.fail.num[i] <- 0
>>>>> }
>>>>> }
>>>>> 
>>>>> out.unpaired.test <- data.frame(success.fail, success.fail.num)
>>>>> }
>>>>> 
>>>>> 
>>>>> 
>> out.unpaired.test<-unpaired.test(verification.plots=verification.plots,
>>>>> project.n=30, project.mean=100, project.sd=10, allowed.deviation=.05,
>>>>> project.acres=99)
>>>>> out.unpaired.test
>>>>> #
>>>>> sequential.unpaired<- function(number.strata, project.n, project.mean,
>>>>> project.sd, verification.plots,allowed.deviation, project.acres,
>>>>> min.plots="default", alpha=.05){
>>>>> 
>>>>> if(min.plots=="default"){
>>>>> min.plots<-NULL # minimum passing
>>>>> if(number.strata == 1 & project.acres <= 100){
>>>>>    min.plots = 8
>>>>> } else if(number.strata == 1 & project.acres > 100 & project.acres <
>>>>> 500){
>>>>> min.plots = 12
>>>>> } else if(number.strata == 1 & project.acres > 500 & project.acres <
>>>>> 5000){
>>>>>    min.plots = 16
>>>>>    } else if(number.strata == 1 & project.acres > 5000 &
>> project.acres
>>>>> <
>>>>> 10000){
>>>>>    min.plots = 20
>>>>> } else if(number.strata == 1 & project.acres > 10000){
>>>>>    min.plots = 24
>>>>>    } else if(number.strata == 2 & project.acres < 100){
>>>>>    min.plots = 4
>>>>> } else if(number.strata == 2 & project.acres > 100 & project.acres <
>>>>> 500){
>>>>>    min.plots = 6
>>>>> } else if(number.strata == 2 & project.acres > 501 & project.acres <
>>>>> 5000){
>>>>>    min.plots = 8
>>>>> } else if(number.strata == 2 & project.acres > 5001 & project.acres <
>>>>> 10000){
>>>>>    min.plots = 10
>>>>>    } else if(number.strata == 2 & project.acres > 10000){
>>>>>    min.plots = 12
>>>>> } else if(number.strata == 3 & project.acres < 100){
>>>>>    min.plots = 2
>>>>>    } else if(number.strata == 3 & project.acres > 100 &
>> project.acres <
>>>>> 500){
>>>>> min.plots = 3
>>>>> } else if(number.strata == 3 & project.acres > 501 & project.acres <
>>>>> 5000){
>>>>> min.plots = 4
>>>>> } else if(number.strata == 3 & project.acres > 5001 & project.acres <
>>>>> 10000){
>>>>> min.plots = 5
>>>>> } else if(number.strata == 3 & project.acres > 10000){
>>>>> min.plots = 6
>>>>> } else {min.plots=NULL}
>>>>> } else (min.plots = min.plots)
>>>>> 
>>>>> 
>>>>> if(number.strata > 3){print("max three strata")}
>>>>> if(number.strata > 3){break}
>>>>> 
>>>>> 
>>>>> p <- length(verification.plots)
>>>>> mp <- min.plots
>>>>> run <- unpaired.test(verification.plots, project.n, project.mean,
>>>>> project.sd,
>>>>> allowed.deviation, project.acres, alpha=.05)
>>>>> number.passing.plots <- function(verification.plots,
>> success.fail.num){
>>>>> n<-length(verification.plots)
>>>>> number.passing<-rep(0,n)
>>>>> number.passing[1]<-ifelse(sum(success.fail.num[1]> mp),1,0)
>>>>> for(i in 2:n){
>>>>> if(success.fail.num[i] == 1){
>>>>> v <- i-1
>>>>> number.passing[i]<-number.passing[v]+1
>>>>> } else{number.passing[i] <- 0}
>>>>> }
>>>>> number.passing
>>>>> }
>>>>> 
>>>>> 
>> number.passing<-number.passing.plots(verification.plots=verification.plots,
>>>>> success.fail.num=run$success.fail.num)
>>>>> sucessful.unsucessful<-rep("blank",p)
>>>>> one.zero <- rep(0,p)
>>>>> result <- "blank"
>>>>> resultL <- 0
>>>>> n <-length(verification.plots)
>>>>> for(i in 1:n){
>>>>> if(number.passing[i] >= mp) {
>>>>> sucessful.unsucessful[i] <- "successful"
>>>>> one.zero[i] <- 1
>>>>> } else {
>>>>> sucessful.unsucessful[i]<- "unsuccessful"
>>>>> one.zero[i] <- 0
>>>>> }
>>>>> }
>>>>> 
>>>>> if(sum(one.zero) > 0 ) {
>>>>> result <- "successful"
>>>>> resultL <- 1
>>>>> }
>>>>> 
>>>>> plots.success <- function(one.zero){
>>>>> 
>>>>> plots.to.success<-NULL
>>>>> 
>>>>> for(i in 1:n){
>>>>> if(sum(one.zero)==0){plots.to.success= NA
>>>>> } else if(one.zero[i]==1){plots.to.success<-append(plots.to.success,
>>>>> i)}
>>>>> }
>>>>> plots.to.success<-min(plots.to.success)
>>>>> plots.to.success
>>>>> }
>>>>> 
>>>>> plots.to.success <- plots.success(one.zero=one.zero)
>>>>> 
>>>>> complete <-data.frame (min.plots, number.passing,
>> sucessful.unsucessful,
>>>>> one.zero)
>>>>> collapsed <- data.frame(result, resultL, plots.to.success)
>>>>> out <- c(as.list(complete),as.list(collapsed))
>>>>> }
>>>>> 
>>>>> 
>>>>> unpaired.out<-sequential.unpaired(number.strata=2,
>>>>> verification.plots=verification.plots, project.n=15, project.mean=100,
>>>>> project.sd=10, allowed.deviation=.05, project.acres=99,
>>>>> min.plots="default")
>>>>> 
>>>>> str(unpaired.out)
>>>>> #unpaired.out[[7]][1]
>>>>> 
>>>>> ##
>>>>> 
>>>>> simulation.unpaired <- function(reps, project.acreage.range = c(99,
>>>>> 110,510,5100,11000), project.mean, project.n.min, project.n.max,
>>>>> project.sd.min, project.sd.max, verification.mean.min,
>>>>> verification.mean.max, number.verification.plots.min,
>>>>> number.verification.plots.max, allowed.deviation=.1, alpha=.05,
>>>>> length.out
>>>>> = 5, min.plots="default") {
>>>>> 
>>>>> number.strata.range<-c(1:3) # set by CAR
>>>>> 
>>>>> project.n.range <- seq(project.n.min, project.n.max,
>>>>> length.out=length.out)
>>>>> 
>>>>> project.sd.range <- seq(project.sd.min, project.sd.max,
>>>>> length.out=length.out) # assume verification sd is the same as the
>>>>> project
>>>>> sd to simplify - seems a reasonable simpification
>>>>> 
>>>>> number.verification.plots<- seq(number.verification.plots.min,
>>>>> number.verification.plots.max, length.out=length.out)
>>>>> 
>>>>> verification.range <- seq(verification.mean.min,
>> verification.mean.max,
>>>>> length.out=length.out)
>>>>> 
>>>>> permut.grid<-expand.grid(number.strata.range, project.n.range,
>>>>> project.acreage.range, project.mean, project.sd.range,
>>>>> number.verification.plots, verification.range, allowed.deviation) #
>>>>> create
>>>>> a matrix with all combinations of the supplied vectors
>>>>> 
>>>>> #assign names to the colums of the grid of combinations
>>>>> names.permut<-c("number.strata", "project.n.plots", "project.acreage",
>>>>> "project.mean", "project.sd", "number.verification.plots",
>>>>> "verification.mean", "allowed.deviation")
>>>>> 
>>>>> names(permut.grid)<-names.permut # done
>>>>> 
>>>>> combinations<-length(permut.grid[,1])
>>>>> 
>>>>> size <-reps*combinations #need to know the size of the master matrix,
>>>>> which
>>>>> is the the number of replications * each combination of the supplied
>>>>> factors
>>>>> 
>>>>> # we want a df from which to read all the data into the simulation,
>> and
>>>>> record the results
>>>>> permut.col<-ncol(permut.grid)
>>>>> col.base<-ncol(permut.grid)+2
>>>>> base <- matrix(nrow=size, ncol=col.base)
>>>>> base <-data.frame(base)
>>>>> 
>>>>> # supply the names
>>>>> names.base<-c("number.strata", "project.n.plots", "project.acreage",
>>>>> "project.mean", "project.sd", "number.verification.plots",
>>>>> "verification.mean", "allowed.deviation","success.fail",
>>>>> "plots.to.success")
>>>>> 
>>>>> names(base)<-names.base
>>>>> 
>>>>> # need to create index vectors for the base, master df
>>>>> ends <- seq(reps+1, size+1, by=reps)
>>>>> begins <- ends-reps
>>>>> index <- cbind(begins, ends-1)
>>>>> #done
>>>>> 
>>>>> # next, need to assign the first 6 columns and number of rows = to the
>>>>> number of reps in the simulation to be the given row in the
>> permut.grid
>>>>> matrix
>>>>> 
>>>>> pb <- winProgressBar(title="Create base, big loop 1 of 2", label="0%
>>>>> done",
>>>>> min=0, max=100, initial=0)
>>>>> 
>>>>> for (i in 1:combinations) {
>>>>> 
>>>>> base[index[i,1]:index[i,2],1:permut.col] <- permut.grid[i,]
>>>>> #progress bar
>>>>> info <- sprintf("%d%% done", round((i/combinations)*100))
>>>>> setWinProgressBar(pb, (i/combinations)*100, label=info)
>>>>> }
>>>>> 
>>>>> close(pb)
>>>>> 
>>>>> # now, simply feed the values replicated the number of times we want
>> to
>>>>> run
>>>>> the simulation into the sequential.unpaired function, and assign the
>>>>> values
>>>>> to the appropriate columns
>>>>> 
>>>>> out.index1<-ncol(permut.grid)+1
>>>>> out.index2<-ncol(permut.grid)+2
>>>>> 
>>>>> #progress bar
>>>>> pb <- winProgressBar(title="fill values, big loop 2 of 2", label="0%
>>>>> done",
>>>>> min=0, max=100, initial=0)
>>>>> 
>>>>> for (i in 1:size){
>>>>> 
>>>>> scalar.base <- base[i,]
>>>>> verification.plots <- rnorm(scalar.base$number.verification.plots,
>>>>> scalar.base$verification.mean, scalar.base$project.sd)
>>>>> result<- sequential.unpaired(scalar.base$number.strata,
>>>>> scalar.base$project.n.plots, scalar.base$project.mean, scalar.base$
>>>>> project.sd, verification.plots, scalar.base$allowed.deviation,
>>>>> scalar.base$project.acreage, min.plots='default', alpha)
>>>>> 
>>>>> base[i,out.index1] <- result[[6]][1]
>>>>> base[i,out.index2] <- result[[7]][1]
>>>>> info <- sprintf("%d%% done", round((i/size)*100))
>>>>> setWinProgressBar(pb, (i/size)*100, label=info)
>>>>> }
>>>>> 
>>>>> close(pb)
>>>>> #return the results
>>>>> return(base)
>>>>> 
>>>>> }
>>>>> 
>>>>> # I would like reps to = 1000, but that takes a really long time right
>>>>> now
>>>>> test5 <- simulation.unpaired(reps=5, project.acreage.range = c(99,
>>>>> 110,510,5100,11000), project.mean=100, project.n.min=10,
>>>>> project.n.max=100,
>>>>> project.sd.min=.01,  project.sd.max=.2, verification.mean.min=100,
>>>>> verification.mean.max=130, number.verification.plots.min=10,
>>>>> number.verification.plots.max=50, length.out = 5)
>>>>> 
>>>>>        [[alternative HTML version deleted]]
>>>>> 
>>>>> ______________________________________________
>>>>> R-help at r-project.org mailing list
>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>> PLEASE do read the posting guide
>>>>> http://www.R-project.org/posting-guide.html
>>>>> and provide commented, minimal, self-contained, reproducible code.
>>>> 
>>>> 
>>>> 
>>>> --
>>>> Joshua Wiley
>>>> Ph.D. Student, Health Psychology
>>>> Programmer Analyst II, Statistical Consulting Group
>>>> University of California, Los Angeles
>>>> https://joshuawiley.com/
>>> 
>>> 
>> 
>> 
>> 
>> --
>> Joshua Wiley
>> Ph.D. Student, Health Psychology
>> Programmer Analyst II, Statistical Consulting Group
>> University of California, Los Angeles
>> https://joshuawiley.com/
>> 
> 
> 	[[alternative HTML version deleted]]
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

David Winsemius
Alameda, CA, USA