[R] Has R recently made performance improvements in accumulation?
Thierry Onkelinx
thierry.onkelinx at inbo.be
Tue Jul 19 16:27:39 CEST 2016
Dear Brent,
I can confirm your timings with
library(microbenchmark)
microbenchmark(
mkFrameForLoop(100, 10),
mkFrameForLoop(200, 10),
mkFrameForLoop(400, 10)
)
but profiling your code shown that rbind only uses a small fraction on the
cpu time used by the function.
profvis::profvis({mkFrameForLoop(100, 10)})
So I cleaned your example further into the function below. Now rbind is
using most of cpu time. And the timings indicate an O(n^2) relation.
minimal <- function(rows, cols){
x <- matrix(NA_integer_, ncol = cols, nrow = 0)
for (i in seq_len(rows)){
x <- rbind(x, rep(i, 10))
}
}
profvis::profvis({minimal(1000, 100)})
timing <- microbenchmark(
X50 = minimal(50, 50),
X100 = minimal(100, 50),
X200 = minimal(200, 50),
X400 = minimal(400, 50),
X800 = minimal(800, 50),
X1600 = minimal(1600, 50)
)
timing
Unit: microseconds
expr min lq mean median uq max
neval cld
X50 199.006 212.278 233.8444 235.728 247.3770 296.987
100 a
X100 565.693 593.957 827.8733 618.835 640.1925 2950.139
100 a
X200 1804.059 1876.390 2166.1106 1903.370 1938.7115 4263.967
100 a
X400 6453.913 8755.848 8546.4339 8890.884 8961.7535 13259.024
100 a
X800 30575.048 32913.186 36555.0118 33093.243 34620.5895 178740.765
100 b
X1600 130976.429 133674.679 151494.6492 135197.087 137327.1235 292291.385
100 c
timing$N <- as.integer(gsub("X", "", levels(timing$expr)))[timing$expr]
model <- lm(time ~ poly(N, 4), data = timing)
summary(model)
Call:
lm(formula = time ~ poly(N, 4), data = timing)
Residuals:
Min 1Q Median 3Q Max
-20518162 -3378940 -130815 -45881 142183951
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 33303987 843350 39.490 <2e-16 ***
poly(N, 4)1 1286962014 20657783 62.299 <2e-16 ***
poly(N, 4)2 338770077 20657783 16.399 <2e-16 ***
poly(N, 4)3 222734 20657783 0.011 0.991
poly(N, 4)4 -2260902 20657783 -0.109 0.913
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 20660000 on 595 degrees of freedom
Multiple R-squared: 0.8746, Adjusted R-squared: 0.8738
F-statistic: 1038 on 4 and 595 DF, p-value: < 2.2e-16
newdata <- data.frame(N = pretty(timing$N, 40))
newdata$time <- predict(model, newdata = newdata)
plot(newdata$N, newdata$time, type = "l")
plot(newdata$N, sqrt(newdata$time), type = "l")
model2 <- lm(sqrt(time) ~ poly(N, 4), data = timing)
summary(model2)
Call:
lm(formula = sqrt(time) ~ poly(N, 4), data = timing)
Residuals:
Min 1Q Median 3Q Max
-756.3 -202.8 -54.7 -5.5 7416.5
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3980.36 33.13 120.160 < 2e-16 ***
poly(N, 4)1 100395.40 811.41 123.730 < 2e-16 ***
poly(N, 4)2 2191.34 811.41 2.701 0.00712 **
poly(N, 4)3 -803.54 811.41 -0.990 0.32243
poly(N, 4)4 82.09 811.41 0.101 0.91945
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 811.4 on 595 degrees of freedom
Multiple R-squared: 0.9626, Adjusted R-squared: 0.9624
F-statistic: 3829 on 4 and 595 DF, p-value: < 2.2e-16
ir. Thierry Onkelinx
Instituut voor natuur- en bosonderzoek / Research Institute for Nature and
Forest
team Biometrie & Kwaliteitszorg / team Biometrics & Quality Assurance
Kliniekstraat 25
1070 Anderlecht
Belgium
To call in the statistician after the experiment is done may be no more
than asking him to perform a post-mortem examination: he may be able to say
what the experiment died of. ~ Sir Ronald Aylmer Fisher
The plural of anecdote is not data. ~ Roger Brinner
The combination of some data and an aching desire for an answer does not
ensure that a reasonable answer can be extracted from a given body of data.
~ John Tukey
2016-07-19 15:40 GMT+02:00 Brent via R-help <r-help op r-project.org>:
> Subtitle: or, more likely, am I benchmarking wrong?
>
> I am new to R, but I have read that R is a hive of performance pitfalls.
> A very common case is if you try to accumulate results into a typical
> immutable R data structure.
>
> Exhibit A for this confusion is this StackOverflow question on an
> algorithm for O(1) performance in appending to a list:
>
> https://stackoverflow.com/questions/2436688/append-an-object-to-a-list-in-r-in-amortized-constant-time-o1
> The original question was asked in 2010-03-13, and responses have trickled
> in for at least the next 5 years.
>
> Before mentioning my problem, I instead offer an example from someone
> vastly better than me in R, which I am sure should be free of mistakes.
> Consider this interesting article on efficient accumulation in R:
> http://www.win-vector.com/blog/2015/07/efficient-accumulation-in-r/
>
> In that article's first example, it claims that the "obvious “for-loop”
> solution" (accumulation into a data.frame using rbind) is:
> is incredibly slow.
> ...
> we pay the cost of processing n*(n+1)/2 rows of data
>
> In other words, what looks like an O(n) algorithm is actually O(n^2).
>
> Sounds awful. But when I tried executing their example on my machine, I
> see O(n) NOT O(n^2) behavior!
>
> Here is the complete code that I executed:
>
> mkFrameForLoop <- function(nRow,nCol) {
> d <- c()
> for(i in seq_len(nRow)) {
> ri <- mkRow(nCol)
> di <- data.frame(ri, stringsAsFactors=FALSE)
> d <- rbind(d,di)
> }
> d
> }
>
> mkRow <- function(nCol) {
> x <- as.list(rnorm(nCol))
> # make row mixed types by changing first column to string
> x[[1]] <- ifelse(x[[1]]>0,'pos','neg')
> names(x) <- paste('x',seq_len(nCol),sep='.')
> x
> }
>
> t1 = Sys.time()
> x = mkFrameForLoop(100, 10)
> tail(x)
> t2 = Sys.time()
> t2 - t1
>
> t1 = Sys.time()
> x = mkFrameForLoop(200, 10)
> tail(x)
> t2 = Sys.time()
> t2 - t1
>
> t1 = Sys.time()
> x = mkFrameForLoop(400, 10)
> tail(x)
> t2 = Sys.time()
> t2 - t1
>
> And here is what I got for the execution times:
>
> Time difference of 0.113005876541138 secs
> Time difference of 0.205012083053589 secs
> Time difference of 0.390022993087769 secs
>
> That's a linear, not quadratic, increase in execution time as a function
> of nRow! It is NOT what I was expecting, altho it was nice to see.
>
> Notes:
> --the functions above are copy and pastes from the article
> --the execution time measurement code is all that I added
> --yes, that time measurement code is crude
> --I am aware of R benchmarking frameworks, in fact, I first tried
> using some of them
> --but when I got unexpected results, I went back to the simplest
> code possible, which is the above
> --it turns out that I get the same results regardless of how I
> measure
> --each measurement doubles the number of rows (from 100 to 200 to
> 400), which is what should be increasing to bring out rbind's allegedly bad
> behavior
> --my machine has a Xeon E3 1245, 8 GB of RAM, running Win 7 Pro 64
> bit, using R 3.3.1 (latest release as of today)
>
> Given those unexpected results, you now understand the title of my post.
> So, has R's runtime somehow gotten greater intelligence recently so that it
> knows when it does not need to copy a data structure? Maybe in the case
> above, it does a quick shallow copy instead of a deep copy? Or, am I
> benchmarking wrong?
>
> What motivated my investigation is that I want to store a bunch of Monte
> Carlo simulation results in a numeric matrix. When I am finished with my
> code, I know that it will be a massive matrix (e.g. 1,000,000 or more
> rows). So I was concerned about performance, and went about benchmarking.
> Let me know if you want to see that benchmark code. I found that
> assignment to a matrix does not seem to generate a copy, even tho
> assignment is a mutating operation, so I was worried that it might. But
> when I investigated deeper, such as with the above code, I got worried that
> I cannot trust my results.
>
>
> I look forward to any feedback that you can give.
>
> I would especially appreciate any links that explain how you can determine
> what the R runtime actually does.
>
> ______________________________________________
> R-help op r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
[[alternative HTML version deleted]]
More information about the R-help
mailing list