[R] performance of do.call("rbind")

Tue Jun 28 05:34:53 CEST 2016

Sarah, you make it sound as though everyone should be using matrices, even 
though they have distinct disadvantages for many types of analysis.

You are right that rbind on data frames is slow, but dplyr::bind_rows 
handles data frames almost as fast as your rbind-ing matrices solution.

And if you apply knowledge of your data frames and don't do the error 
checking that bind_rows does, you can beat both of them without converting 
to matrices, as the "tm.dfcolcat" solution below illustrates. (Not for 
everyday use, but if you have a big job and the data are clean this may 
make a difference.)

Data frames, handled properly, are only slightly slower than matrices for 
most purposes. I have seen numerical solutions of partial differential 
equations run lightning fast using pre-allocated data frames and vector 
calculations, so even traditional "matrix" calculation domains don't have 
use matrices to be competitive.

######################
testsize <- 5000
N <- 20

set.seed(1234)
testdf.list <- lapply( seq_len( testsize )
                      , function( x ) {
                         data.frame( matrix( runif( 300 ), nrow=100 ) )
                        }
                      )

tm.rbind <- function( x = 0 ) {
   system.time( r.df <- do.call( "rbind", testdf.list ) )
}
#toss the first one
tm.rbind()
tms.rbind <- data.frame( do.call( rbind
                                 , lapply( 1:N
                                         , tm.rbind
                                         )
                                 )
                        , which = "rbind"
                        )

tm.rbindm <- function( x = 0 ) {
   system.time({
     testm.list <- lapply( testdf.list, as.matrix )
     r.m <- do.call( rbind, testm.list )
   })
}
#toss the first one
tm.rbindm()
tms.rbindm <- data.frame( do.call( rbind
                                  , lapply( 1:N
                                          , tm.rbindm
                                          )
                                  )
                         , which = "rbindm"
                         )

tm.dfcopy <- function(x=0) {
   system.time({
     l.df <- data.frame( matrix( NA
                               , nrow=100 * testsize
                               , ncol=3
                               )
                       )
     for ( i in seq_len( testsize ) ) {
       start <- ( i - 1 ) * 100 + 1
       end <- i * 100
       l.df[ start:end, ] <- testdf.list[[ i ]]
     }
   })
}
#toss the first one
tm.dfcopy()
tms.dfcopy <- data.frame( do.call( rbind
                                  , lapply( 1:N
                                          , tm.dfcopy
                                          )
                                  )
                         , which = "dfcopy"
                         )

tm.dfmatcopy <- function(x=0) {
   system.time({
     l.m <- data.frame( matrix( NA
                              , nrow=100 * testsize
                              , ncol = 3
                              )
                      )
     testm.list <- lapply( testdf.list, as.matrix )
     for ( i in seq_len( testsize ) ) {
       start <- ( i - 1 ) * 100 + 1
       end <- i * 100
       l.m[ start:end, ] <- testm.list[[ i ]]
     }
   })
}
#toss the first one
tm.dfmatcopy()
tms.dfmatcopy <- data.frame( do.call( rbind
                                     , lapply( 1:N
                                             , tm.dfmatcopy
                                             )
                                     )
                            , which = "dfmatcopy"
                            )

tm.bind_rows <- function(x=0) {
   system.time({
     dplyr::bind_rows( testdf.list )
   })
}
#toss the first one
tm.bind_rows()
tms.bind_rows <- data.frame( do.call( rbind
                                     , lapply( 1:N
                                             , tm.bind_rows
                                             )
                                     )
                            , which = "bind_rows"
                            )

tm.dfcolcat <- function(x=0) {
   system.time({
     mycolnames <- names( testdf.list[[ 1 ]] )
     result <-
       setNames( data.frame( lapply( mycolnames
                                   , function( colidx ) {
                                       do.call( c
                                              , lapply( testdf.list
                                                      , function( v ) {
                                                          v[[ colidx ]]
                                                        }
                                                      )
                                              )
                                     }
                                   )
                           )
               , mycolnames
               )
       })
}
#toss the first one
tm.dfcolcat()
tms.dfcolcat <- data.frame( do.call( rbind, lapply( 1:N
                                                   , tm.dfcolcat
                                                   )
                                    )
                           , which = "dfcolcat"
                           )

tms.sarah <- read.table( text=
"   user  system elapsed  which
   34.280   0.009  34.317  tm.rbind
    0.310   0.000   0.311  tm.rbindm
   81.890   0.069  82.162  tm.dfcopy
   67.664   0.047  68.009  tm.dfmatcopy
", header = TRUE, as.is=TRUE )
mergetms <- rbind( tms.rbind
                  , tms.rbindm
                  , tms.dfcopy
                  , tms.dfmatcopy
                  , tms.bind_rows
                  , tms.dfcolcat
                  )
mergetms$which <- factor( mergetms$which
                         , levels = c( "rbind"
                                     , "rbindm"
                                     , "dfcopy"
                                     , "dfmatcopy"
                                     , "bind_rows"
                                     , "dfcolcat"
                                     )
                         )
plot( user.self ~ which, data=mergetms )
plot( user.self ~ which, data=mergetms, ylim=c(0,4) )

summary( tms.rbind )
#   user.self        sys.self         elapsed        user.child    sys.child
# Min.   :18.84   Min.   :0.0000   Min.   :18.92   Min.   : NA   Min.   : NA
# 1st Qu.:20.83   1st Qu.:0.0275   1st Qu.:20.96   1st Qu.: NA   1st Qu.: NA
# Median :22.91   Median :0.0400   Median :23.00   Median : NA   Median : NA
# Mean   :25.06   Mean   :0.0430   Mean   :25.21   Mean   :NaN   Mean   :NaN
# 3rd Qu.:24.29   3rd Qu.:0.0600   3rd Qu.:24.39   3rd Qu.: NA   3rd Qu.: NA
# Max.   :39.36   Max.   :0.1000   Max.   :39.94   Max.   : NA   Max.   : NA
#                                                  NA's   :20    NA's   :20

summary( tms.rbindm )
#   user.self         sys.self    elapsed         user.child    sys.child
# Min.   :0.2200   Min.   :0   Min.   :0.2200   Min.   : NA   Min.   : NA
# 1st Qu.:0.5600   1st Qu.:0   1st Qu.:0.5800   1st Qu.: NA   1st Qu.: NA
# Median :0.5850   Median :0   Median :0.5900   Median : NA   Median : NA
# Mean   :0.5465   Mean   :0   Mean   :0.5555   Mean   :NaN   Mean   :NaN
# 3rd Qu.:0.5900   3rd Qu.:0   3rd Qu.:0.5925   3rd Qu.: NA   3rd Qu.: NA
# Max.   :0.6100   Max.   :0   Max.   :0.6100   Max.   : NA   Max.   : NA
#                                               NA's   :20    NA's   :20

summary( tms.dfcopy )
#   user.self        sys.self         elapsed        user.child    sys.child
# Min.   :114.2   Min.   :0.0000   Min.   :114.3   Min.   : NA   Min.   : NA
# 1st Qu.:122.7   1st Qu.:0.0000   1st Qu.:123.0   1st Qu.: NA   1st Qu.: NA
# Median :128.3   Median :0.0050   Median :128.4   Median : NA   Median : NA
# Mean   :134.5   Mean   :0.0185   Mean   :134.8   Mean   :NaN   Mean   :NaN
# 3rd Qu.:134.7   3rd Qu.:0.0325   3rd Qu.:134.8   3rd Qu.: NA   3rd Qu.: NA
# Max.   :261.5   Max.   :0.0800   Max.   :263.4   Max.   : NA   Max.   : NA
#                                                  NA's   :20    NA's   :20

summary( tms.dfmatcopy )
#   user.self         sys.self         elapsed        user.child    sys.child
# Min.   : 98.15   Min.   : 0.050   Min.   :102.0   Min.   : NA   Min.   : NA
# 1st Qu.:136.47   1st Qu.: 3.495   1st Qu.:144.6   1st Qu.: NA   1st Qu.: NA
# Median :147.53   Median : 7.135   Median :158.3   Median : NA   Median : NA
# Mean   :177.10   Mean   : 7.030   Mean   :185.2   Mean   :NaN   Mean   :NaN
# 3rd Qu.:159.12   3rd Qu.:10.932   3rd Qu.:166.9   3rd Qu.: NA   3rd Qu.: NA
# Max.   :362.95   Max.   :16.100   Max.   :364.3   Max.   : NA   Max.   : NA
#                                                   NA's   :20    NA's

summary( tms.bind_rows )
#   user.self         sys.self    elapsed         user.child    sys.child
# Min.   :0.8200   Min.   :0   Min.   :0.8200   Min.   : NA   Min.   : NA
# 1st Qu.:0.8300   1st Qu.:0   1st Qu.:0.8375   1st Qu.: NA   1st Qu.: NA
# Median :0.8400   Median :0   Median :0.8400   Median : NA   Median : NA
# Mean   :0.8460   Mean   :0   Mean   :0.8480   Mean   :NaN   Mean   :NaN
# 3rd Qu.:0.8525   3rd Qu.:0   3rd Qu.:0.8525   3rd Qu.: NA   3rd Qu.: NA
# Max.   :0.9400   Max.   :0   Max.   :0.9900   Max.   : NA   Max.   : NA
#                                               NA's   :20    NA's   :20

summary( tms.dfcolcat )
# user.self        sys.self    elapsed        user.child    sys.child
# Min.   :0.340   Min.   :0   Min.   :0.340   Min.   : NA   Min.   : NA
# 1st Qu.:0.350   1st Qu.:0   1st Qu.:0.350   1st Qu.: NA   1st Qu.: NA
# Median :0.360   Median :0   Median :0.360   Median : NA   Median : NA
# Mean   :0.358   Mean   :0   Mean   :0.357   Mean   :NaN   Mean   :NaN
# 3rd Qu.:0.360   3rd Qu.:0   3rd Qu.:0.360   3rd Qu.: NA   3rd Qu.: NA
# Max.   :0.380   Max.   :0   Max.   :0.380   Max.   : NA   Max.   : NA
#                                             NA's   :20    NA's   :20

######################

On Mon, 27 Jun 2016, Sarah Goslee wrote:

> That's not what I said, though, and it's not necessarily true. Growing
> an object within a loop _is_ a slow process, but that's not the
> problem here. The problem is using data frames instead of matrices.
> The need to manage column classes is very costly. Converting to
> matrices will almost always be enormously faster.
>
> Here's an expansion of the previous example I posted, in four parts:
> 1. do.call with data frame - very slow - 34.317 s elapsed time for
> 2000 data frames
> 2. do.call with matrix - very fast - 0.311 s elapsed
> 3. pre-allocated loop with data frame - even slower (!) - 82.162 s
> 4. pre-allocated loop with matrix - very fast - 68.009 s
>
> It matters whether the columns are converted to numeric or character,
> and the time doesn't scale linearly with list length. For a particular
> problem, the best solution may vary greatly (and I didn't even include
> packages beyond the base functionality). In general, though, using
> matrices is faster than using data frames, and using do.call is faster
> than using a pre-allocated loop, which is much faster than growing an
> object.
>
> Sarah
>
>> testsize <- 5000
>>
>> set.seed(1234)
>> testdf <- data.frame(matrix(runif(300), nrow=100, ncol=3))
>> testdf.list <- lapply(seq_len(testsize), function(x)testdf)
>>
>> system.time(r.df <- do.call("rbind", testdf.list))
>   user  system elapsed
> 34.280   0.009  34.317
>>
>> system.time({
> + testm.list <- lapply(testdf.list, as.matrix)
> + r.m <- do.call("rbind", testm.list)
> + })
>   user  system elapsed
>  0.310   0.000   0.311
>>
>> system.time({
> + l.df <- data.frame(matrix(NA, nrow=100 * testsize, ncol=3))
> + for(i in seq_len(testsize)) {
> + start <- (i-1)*100 + 1
> + end <- i*100
> + l.df[start:end, ] <- testdf.list[[i]]
> + }
> + })
>   user  system elapsed
> 81.890   0.069  82.162
>>
>> system.time({
> + l.m <- data.frame(matrix(NA, nrow=100 * testsize, ncol=3))
> + testm.list <- lapply(testdf.list, as.matrix)
> + for(i in seq_len(testsize)) {
> + start <- (i-1)*100 + 1
> + end <- i*100
> + l.m[start:end, ] <- testm.list[[i]]
> + }
> + })
>   user  system elapsed
> 67.664   0.047  68.009
>
>
>
>
> On Mon, Jun 27, 2016 at 1:05 PM, Marc Schwartz <marc_schwartz at me.com> wrote:
>> Hi,
>>
>> Just to add my tuppence, which might not even be worth that these days...
>>
>> I found the following blog post from 2013, which is likely dated to some extent, but provided some benchmarks for a few methods:
>>
>>   http://rcrastinate.blogspot.com/2013/05/the-rbinding-race-for-vs-docall-vs.html
>>
>> There is also a comment with a reference there to using the data.table package, which I don't use, but may be something to evaluate.
>>
>> As Bert and Sarah hinted at, there is overhead in taking the repetitive piecemeal approach.
>>
>> If all of your data frames are of the exact same column structure (column order, column types), it may be prudent to do your own pre-allocation of a data frame that is the target row total size and then "insert" each "sub" data frame by using row indexing into the target structure.
>>
>> Regards,
>>
>> Marc Schwartz
>>
>>
>>> On Jun 27, 2016, at 11:54 AM, Witold E Wolski <wewolski at gmail.com> wrote:
>>>
>>> Hi Bert,
>>>
>>> You are most likely right. I just thought that do.call("rbind", is
>>> somehow more clever and allocates the memory up front. My error. After
>>> more searching I did find rbind.fill from plyr which seems to do the
>>> job (it computes the size of the result data.frame and allocates it
>>> first).
>>>
>>> best
>>>
>>> On 27 June 2016 at 18:49, Bert Gunter <bgunter.4567 at gmail.com> wrote:
>>>> The following might be nonsense, as I have no understanding of R
>>>> internals; but ....
>>>>
>>>> "Growing" structures in R by iteratively adding new pieces is often
>>>> warned to be inefficient when the number of iterations is large, and
>>>> your rbind() invocation might fall under this rubric. If so, you might
>>>> try  issuing the call say, 20 times, over 10k disjoint subsets of the
>>>> list, and then rbinding up the 20 large frames.
>>>>
>>>> Again, caveat emptor.
>>>>
>>>> Cheers,
>>>> Bert
>>>>
>>>>
>>>> Bert Gunter
>>>>
>>>> "The trouble with having an open mind is that people keep coming along
>>>> and sticking things into it."
>>>> -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
>>>>
>>>>
>>>> On Mon, Jun 27, 2016 at 8:51 AM, Witold E Wolski <wewolski at gmail.com> wrote:
>>>>> I have a list (variable name data.list) with approx 200k data.frames
>>>>> with dim(data.frame) approx 100x3.
>>>>>
>>>>> a call
>>>>>
>>>>> data <-do.call("rbind", data.list)
>>>>>
>>>>> does not complete - run time is prohibitive (I killed the rsession
>>>>> after 5 minutes).
>>>>>
>>>>> I would think that merging data.frame's is a common operation. Is
>>>>> there a better function (more performant) that I could use?
>>>>>
>>>>> Thank you.
>>>>> Witold
>>>>>
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

---------------------------------------------------------------------------
Jeff Newmiller                        The     .....       .....  Go Live...
DCN:<jdnewmil at dcn.davis.ca.us>        Basics: ##.#.       ##.#.  Live Go...
                                       Live:   OO#.. Dead: OO#..  Playing
Research Engineer (Solar/Batteries            O.O#.       #.O#.  with
/Software/Embedded Controllers)               .OO#.       .OO#.  rocks...1k