[R] How can I avoid nested 'for' loops or quicken the process?

Fri Dec 26 17:29:09 CET 2008

On Fri, 26 Dec 2008, Bert Gunter wrote:

> Thankyou for the clarification, Brian. This is very helpful (as usual).
>
> However, I think the important point, which I misstated, is that whether it
> be for() or, e.g. lapply(), the "loop" contents must be evaluated at the
> interpreted R level, and this is where most time is typically spent. To get
> the speedup that most people hope for, avoiding the loop altogether (i.e.
> moving loop **and** evaluations) to C level, via R programming -- e.g. via
> use of matrix operations, indexing, or built-in .Internal functions, etc. --
> is the key.
>
> Please correct me if I'm (even partially) wrong. As you know, the issue
> arises frequently.

'Typically' is not the whole story.  In a loop like

Y <- double(length(X))
for(i in seq_along(X)) Y[i] <- fun(X[i])

quite a lot of time and memory may be spent in re-allocating Y at each
step of the loop, and lapply() is able to avoid that.  E.g.

X <- runif(1e6)
system.time({
Y <- double(length(X))
for(i in seq_along(X)) Y[i] <- sin(X[i])
})

takes 5.2 secs vs unlist(lapply(X, sin)) which takes 1.5.  Of course, 
using the vectorized function sin() takes 0.05 sec.  If you use sapply you 
will lose all the gain.

This is not a typical example, but it arises often enough to make it 
worthwhile having an optimized lapply().

>
> -- Bert Gunter
> Genentech
>
> -----Original Message-----
> From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On
> Behalf Of Prof Brian Ripley
> Sent: Friday, December 26, 2008 12:44 AM
> To: Oliver Bandel
> Cc: r-help at stat.math.ethz.ch
> Subject: Re: [R] How can I avoid nested 'for' loops or quicken the process?
>
> On Thu, 25 Dec 2008, Oliver Bandel wrote:
>
>> Bert Gunter <gunter.berton <at> gene.com> writes:
>>
>>>
>>> FWIW:
>>>
>>> Good advice below! -- after all, the first rule of optimizing code is:
>>> Don't!
>>>
>>> For the record (yet again), the apply() family of functions (and their
>>> packaged derivatives, of course) are "merely" vary carefully written
> for()
>>> loops: their main advantage is in code readability, not in efficiency
> gains,
>>> which may well be small or nonexistent. True efficiency gains require
>>> "vectorization", which essentially moves the for() loops from interpreted
>>> code to (underlying) C code (on the underlying data structures): e.g.
>>> compare rowMeans() [vectorized] with ave() or apply(..,1,mean).
>> [...]
>>
>> The apply-functions do bring speed-advantages.
>>
>> This is not only what I read about it,
>> I have used the apply-functions and really got
>> results faster.
>>
>> The reason is simple: an apply-function does
>> make in C, what otherwise would be done on the level of R
>> with for-loops.
>
> Not true of apply(): true of lapply() and hence sapply().  I'll leave you
> to check eapply, mapply, rapply, tapply.
>
> So the issue is what is meant by 'the apply() family of functions': people
> often mean *apply(), of which apply() is an unusual member, if one at all.
>
> [Historical note: a decade ago lapply was internally a for() loop.  I
> rewrote it in C in 2000: I also moved apply to C at the same time but it
> proved too little an advantage and was reverted.  The speed of lapply
> comes mainly from reduced memory allocation: for() is also written in C.]

-- 
Brian D. Ripley,                  ripley at stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595