[R] How to make the "apply" faster

William Dunlap wdunlap at tibco.com
Mon Jul 11 22:32:09 CEST 2016


If you use Rcpp::Rcpp.package.skeleton() to make a package out
of the attached C++ code you can speed up the run counting quite
a bit - to 1.4 s. from 48 s. for the 101 x 107 x 17 x 103 example.

The package you make will define an R function called CountColumnRuns
that computes the number of runs of TRUE of a given minimum length
for each column in a matrix.  You can adapt it to your 4-dimensional array
with

f5 <- function (condition, spellLength = 2)
{
    stopifnot(is.logical(condition), !anyNA(condition))
    d <- dim(condition)
    dn <- dimnames(condition)
    tmp <- array(aperm(condition, c(3, 1, 2, 4)), c(d[3], prod(d[-3])))
    runCounts <- RcppCountRuns::CountColumnRuns(tmp, spellLength)
    array(runCounts, dim = d[-3], dimnames = dn[-3])
}

and call it as

f5( x > 1, spellLength=2)

to get the counts of all runs of length at least 2 in the 3rd dimension of
your 4-d array.



#include <Rcpp.h>

int CountRunsRaw(const int *x, int n, int minRun)
{
    int count(0);
    int runLength(0);
    const int* lastX(x + n);
    for( ; x != lastX ; x++)
    {
        if (*x)
        {
            runLength++;
        }
        else
        {
            if (runLength >= minRun)
            {
                count++;
            }
            runLength = 0;
        }
    }
    if (runLength >= minRun)
    {
        count++;
    }
    return count;
}

// [[Rcpp::export]]
int CountRuns(const Rcpp::LogicalVector& x, int minRun)
{
    return CountRunsRaw(&x[0], x.size(), minRun);
}
// [[Rcpp::export]]
Rcpp::IntegerVector CountColumnRuns(const Rcpp::LogicalMatrix& x, int
minRun)
{
    Rcpp::IntegerVector out(x.ncol());
    for(int i = 0 ; i < x.ncol() ; i++)
    {
        out[i] = CountRuns(x(Rcpp::_, i), minRun);
    }
    return out;
}


Bill Dunlap
TIBCO Software
wdunlap tibco.com

On Mon, Jul 11, 2016 at 9:42 AM, William Dunlap <wdunlap at tibco.com> wrote:

> How fast is fast enough and what size and shape is your dataset
> (show the output of str(yourData))?  You will get the fastest execution
> time by using C or C++ or Fortran, but you will want to parameterize
> the problem well enough that you can amortize the time it takes to
> write the code over many problems.
>
> Peter Langflder's suggested you use aperm followed by colSums
> for your earlier problem.   For this one you can use aperm followed
> by filter (to identify the runs of a given minimum length, column by
> column) and then use colSums (to count the number of runs filter
> identifies in each column).  E.g.,
>
>   f4 <- function (condition, spellLength = 2)
>   {
>       # obscure way to count runs of length>spellLength
>       # in 3'rd dimension of logical array 'condition'.
>       stopifnot(is.logical(condition), !anyNA(condition))
>       coef <- c(-1, rep(1, spellLength))
>       d <- dim(condition)
>       dn <- dimnames(condition)
>       tmp <- array(aperm(condition * 2 - 1, c(3, 1, 2, 4)), c(d[3],
>           prod(d[-3])))
>       fTmp <- filter(rbind(tmp, -1), coef, sides = 1)
>      sfTmp <- colSums(fTmp == spellLength + 1, na.rm = TRUE)
>      array(sfTmp, dim = d[-3], dimnames = dn[-3])
>   }
>
> f4(x>=1) is not a great deal faster (48 s. vs. 67 s.) than
> apply(x>=1, c(1,2,4), FUN=f3) where f3 is
>   f3 <- function (condition, spellLength = 2)
>   {
>       stopifnot(is.logical(condition), !anyNA(condition))
>       y <- rle(condition)
>       sum(y$lengths[y$values] >= spellLength)
>   }
> and where x has dimensions c(101,107,17,103).
>
> The relative speed will depend on the size and shape of your dataset,
> so show the output of str(yourData).
>
>
>
>
>
> Bill Dunlap
> TIBCO Software
> wdunlap tibco.com
>
> On Sun, Jul 10, 2016 at 1:38 PM, Debasish Pai Mazumder <pai1981 at gmail.com>
> wrote:
>
>> Thanks for your response. It is faster than before but still very slow.
>> Any other suggestion ?
>> -Deb
>>
>>
>> On Sun, Jul 10, 2016 at 2:13 PM, William Dunlap <wdunlap at tibco.com>
>> wrote:
>>
>>> There is no need to test that a logical equals TRUE:
>>> 'logicalVector==TRUE' is the
>>> same as just 'logicalVector'.
>>>
>>> There is no need to convert logical vectors to numeric, since rle()
>>> works on both
>>> types.
>>>
>>> There is no need to use length(subset(x, logicalVector)) to count how
>>> many elements
>>> in logicalVector are TRUE, just use sum(logicalVector).
>>>
>>> There is no need to make a variable, 'ans', then immediately return it.
>>>
>>> Hence your
>>>
>>>     b[b == TRUE] = 1
>>>     y <- rle(b)
>>>     ans <- length(subset(y$lengths[y$values == 1], y$lengths[y$values ==
>>> 1] >= 2))
>>>     return(ans)
>>>
>>> could be replaced by
>>>
>>>     y <- rle(b)
>>>     sum(y$lengths[y$values] >= 2)
>>>
>>> This gives some speedup, mainly for long vectors, but I find it more
>>> understandable.
>>> E.g., if f1 is your original function and f2 has the above replacement I
>>> get:
>>>   > d <- -sin(1:10000+sqrt(1:4))
>>>   > system.time(for(i in 1:10000)f1(d,.3))
>>>      user  system elapsed
>>>      5.19    0.00    5.19
>>>   > system.time(for(i in 1:10000)f2(d,.3))
>>>      user  system elapsed
>>>      3.65    0.00    3.65
>>>   > c(f1(d,.3), f2(d,.3))
>>>   [1] 1492 1492
>>>   > length(d)
>>>   [1] 10000
>>>
>>> If it were my function, I would also get rid of the part that deals with
>>> the threshhold
>>> and direction of the inequality and tell the user to to use f(data <=
>>> 0.3) instead of
>>> f(data, .3, "below").  I would also make the spell length an argument
>>> instead of
>>> fixing it at 2.  E.g.
>>>
>>>    > f3 <- function (condition, spellLength = 2)
>>>    {
>>>        stopifnot(is.logical(condition), !anyNA(condition))
>>>        y <- rle(condition)
>>>        sum(y$lengths[y$values] >= spellLength)
>>>    }
>>>    > f3( d >= .3 )
>>>    [1] 1492
>>>
>>>
>>>
>>> Bill Dunlap
>>> TIBCO Software
>>> wdunlap tibco.com
>>>
>>> On Sun, Jul 10, 2016 at 11:58 AM, Debasish Pai Mazumder <
>>> pai1981 at gmail.com> wrote:
>>>
>>>> Hi Everyone,
>>>> Thanks for your help. It works. I have similar problem when I am
>>>> calculating number of spell.
>>>> I am also calculation spell (definition: period of two or more days
>>>> where x
>>>> exceeds 70) using similar way:
>>>>
>>>> *new = apply(x,c(1,2,4),FUN=function(y) {fun.spell.deb(y, 70)})*
>>>>
>>>> where fun.spell.deb.R:
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> *## Calculate spell durationfun.spell.deb <- function(data, threshold =
>>>> 1,
>>>> direction = c("above", "below")){  #coln <- grep(weather, names(data))#
>>>> var <- data[,8]  if(missing(direction)) {direction <- "above"}
>>>> if(direction=="below") {b <- (data <= threshold)} else  {b <- (data >=
>>>> threshold)}    b[b==TRUE] = 1  y <-rle(b)  ans
>>>> <-length(subset((y$lengths[y$values==1]), (y$lengths[y$values==1])>=2))
>>>> return(ans)}*
>>>>
>>>> Do you have any idea how to make the "apply" faster here?
>>>>
>>>> -Deb
>>>>
>>>>
>>>> On Sat, Jul 9, 2016 at 3:46 PM, Charles C. Berry <ccberry at ucsd.edu>
>>>> wrote:
>>>>
>>>> > On Sat, 9 Jul 2016, Debasish Pai Mazumder wrote:
>>>> >
>>>> > I have 4-dimension array x(lat,lon,time,var)
>>>> >>
>>>> >> I am using "apply" to calculate over time
>>>> >> new = apply(x,c(1,2,4),FUN=function(y) {length(which(y>=70))})
>>>> >>
>>>> >> This is very slow. Is there anyway make it faster?
>>>> >>
>>>> >
>>>> > If dim(x)[3] << prod(dim(x)[-3]),
>>>> >
>>>> > new <-  Reduce("+",lapply(1:dim(x)[3],function(z) x[,,z,]>=70))
>>>> >
>>>> > will be faster.
>>>> >
>>>> > However, if you can follow Peter Langfelder's suggestion to use
>>>> rowSums,
>>>> > that would be best. Even using rowSums(aperm(x,c(1,2,4,3)>=70,dims=3)
>>>> and
>>>> > paying the price of aperm() might be better.
>>>> >
>>>> > Chuck
>>>> >
>>>>
>>>>         [[alternative HTML version deleted]]
>>>>
>>>> ______________________________________________
>>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>> PLEASE do read the posting guide
>>>> http://www.R-project.org/posting-guide.html
>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>
>>>
>>>
>>
>

	[[alternative HTML version deleted]]



More information about the R-help mailing list