[R] Speed Advice for R --- avoid data frames

Frank Harrell f.harrell at vanderbilt.edu
Wed Jul 6 14:57:44 CEST 2011


On occasion, as pointed out in an earlier posting, it is efficient to convert
to a matrix and when finished convert back to a data frame.  The Hmisc
package's asNumericMatrix and matrix2dataFrame functions assist by
converting character variables to factors if needed, and by holding on to
original attributes of variables in the data frame such as "levels", then
restoring the attributes.

Frank


Uwe Ligges-3 wrote:
> 
> On 02.07.2011 21:35, ivo welch wrote:
>> hi uwe---thanks for the clarification.  of course, my example should
>> always
>> be done in vectorized form.  I only used it to show how iterative access
>> compares in the simplest possible fashion.<100 accesses per seconds is
>> REALLY slow, though.
>>
>> I don't know R internals and the learning curve would be steep. 
>> moreover,
>> there is no guarantee that changes I would make would be accepted.  so, I
>> cannot do this.
>>
>> however, for an R expert, this should not be too difficult. 
>> conceptually,
>> if data frame element access primitives are create/write/read/destroy in
>> the
>> code, then it's truly trivial.  just add a matrix (dim the same as the
>> data
>> frame) of byte pointers to point at the storage upon creation/change
>> time.
>>   this would be quick-and-dirty.  for curiosity, do you know which source
>> file has the data frame internals?  maybe I will get tempted anyway if it
>> is
>> simple enough.
> 
> 
> I think you should start to look at the mechanisms to construct 
> data.frames (such as data.frame) and learn that data.frames are special 
> lists. Then you may want to look at the differences between the 
> .Primitive("[") and .Primitive("[<-") used for vectors (including 
> vectors with dim attributes such as matrixes) and the correspoding 
> methods for data.frames: "[<-.data.frame" and "[.data.frame".
> 
> After that, I doubt you want to improve further on. Note also that 
> data.frames can be pretty large and you really do not want to store a 
> matrix of pointers as large as the data.frame. People working witrh 
> large data.frames won't be happy with such a suggestion.
> 
> If you want to follow up, I'd suggest to move the thread to R-devel 
> where it seems to be more appropriate.
> 
> Best,
> Uwe
> 
> 
> 
> 
> 
> 
>>
>> (a more efficient but more involved way to do this would be to store a
>> data
>> frame internally always as a matrix of data pointers, but this would
>> probably require more surgery.)
>>
>> It is also not as important for me, as it is for others...to give a good
>> impression to those that are not aware of the tradeoffs---which is most
>> people considering to adopt R.
>>
>> /iaw
>>
>>
>> ----
>> Ivo Welch (ivo.welch at gmail.com)
>>
>>
>>
>>
>> 2011/7/2 Uwe Ligges<ligges at statistik.tu-dortmund.de>
>>
>>> Some comments:
>>>
>>> the comparison matrix rows vs. matrix columns is incorrect: Note that R
>>> has
>>> lazy evaluation, hence you construct your matrix in the timing for the
>>> rows
>>> and it is already constructed in the timing for the columns, hence you
>>> want
>>> to use:
>>>
>>>   M<- matrix( rnorm(C*R), nrow=R )
>>>   D<- as.data.frame(matrix( rnorm(C*R), nrow=R ) )
>>>   example(M)
>>>   example(D)
>>>
>>> Further on, you are correct with you statement that data.frame indexing
>>> is
>>> much slower, but if you can store your data in matrix form, just go on
>>> as it
>>> is.
>>>
>>> I doubt anybody is really going to make the index operation you cited
>>> within a loop. Then, with a data.frame, I can live with many vectorized
>>> replacements again:
>>>
>>>> system.time(D[,20]<- sqrt(abs(D[,20])) + rnorm(1000))
>>>    user  system elapsed
>>>    0.01    0.00    0.01
>>>
>>>> system.time(D[20,]<- sqrt(abs(D[20,])) + rnorm(1000))
>>>    user  system elapsed
>>>    0.51    0.00    0.52
>>>
>>> OK, it would be nice to do that faster, but this is not easy. I think R
>>> Core is happy to see contributions to make it faster without breaking
>>> existing features.
>>>
>>>
>>>
>>> Best wishes,
>>> Uwe
>>>
>>>
>>>
>>>
>>> On 02.07.2011 20:35, ivo welch wrote:
>>>
>>>> This email is intended for R users that are not that familiar with R
>>>> internals and are searching google about how to speed up R.
>>>>
>>>> Despite common misperception, R is not slow when it comes to iterative
>>>> access.  R is fast when it comes to matrices.  R is very slow when it
>>>> comes to iterative access into data frames.  Such access occurs when a
>>>> user uses "data$varname[index]", which is a very common operation.  To
>>>> illustrate, run the following program:
>>>>
>>>> R<- 1000; C<- 1000
>>>>
>>>> example<- function(m) {
>>>>    cat("rows: "); cat(system.time( for (r in 1:R) m[r,20]<-
>>>> sqrt(abs(m[r,20])) + rnorm(1) ), "\n")
>>>>    cat("columns: "); cat(system.time(for (c in 1:C) m[20,c]<-
>>>> sqrt(abs(m[20,c])) + rnorm(1)), "\n")
>>>>    if (is.data.frame(m)) { cat("df: columns as names: ");
>>>> cat(system.time(for (c in 1:C) m[[c]][20]<- sqrt(abs(m[[c]][20])) +
>>>> rnorm(1)), "\n") }
>>>> }
>>>>
>>>> cat("\n**** Now as matrix\n")
>>>> example( matrix( rnorm(C*R), nrow=R ) )
>>>>
>>>> cat("\n**** Now as data frame\n")
>>>> example( as.data.frame( matrix( rnorm(C*R), nrow=R ) ) )
>>>>
>>>>
>>>> The following are the reported timing under R 2.12.0 on a Mac Pro 3,1
>>>> with ample RAM:
>>>>
>>>> matrix, columns: 0.01s
>>>> matrix, rows: 0.175s
>>>> data frame, columns: 53s
>>>> data frame, rows: 56s
>>>> data frame, names: 58s
>>>>
>>>> Data frame access is about 5,000 times slower than matrix column
>>>> access, and 300 times slower than matrix row access.  R's data frame
>>>> operational speed is an amazing 40 data accesses per seconds.  I have
>>>> not seen access numbers this low for decades.
>>>>
>>>>
>>>> How to avoid it?  Not easy.  One way is to create multiple matrices,
>>>> and group them as an object.  of course, this loses a lot of features
>>>> of R.  Another way is to copy all data used in calculations out of the
>>>> data frame into a matrix, do the operations, and then copy them back.
>>>> not ideal, either.
>>>>
>>>> In my opinion, this is an R design flow.  Data frames are the
>>>> fundamental unit of much statistical analysis, and should be fast.  I
>>>> think R lacks any indexing into data frames.  Turning on indexing of
>>>> data frames should at least be an optional feature.
>>>>
>>>>
>>>> I hope this message post helps others.
>>>>
>>>> /iaw
>>>>
>>>> ----
>>>> Ivo Welch (ivo.welch at gmail.com)
>>>> http://www.ivo-welch.info/
>>>>
>>>> ______________________________**________________
>>>> R-help at r-project.org mailing list
>>>> https://stat.ethz.ch/mailman/**listinfo/r-help<https://stat.ethz.ch/mailman/listinfo/r-help>
>>>> PLEASE do read the posting guide http://www.R-project.org/**
>>>> posting-guide.html<http://www.R-project.org/posting-guide.html>
>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>
>>>
>>
>> 	[[alternative HTML version deleted]]
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
> 


-----
Frank Harrell
Department of Biostatistics, Vanderbilt University
--
View this message in context: http://r.789695.n4.nabble.com/Speed-Advice-for-R-avoid-data-frames-tp3640932p3648681.html
Sent from the R help mailing list archive at Nabble.com.



More information about the R-help mailing list