[R] Possible Improvement to sapply

William Dunlap wdunlap at tibco.com
Tue Mar 13 17:10:55 CET 2018


Wouldn't that change how simplify='array' is handled?

>  str(sapply(1:3, function(x)diag(x,5,2), simplify="array"))
 int [1:5, 1:2, 1:3] 1 0 0 0 0 0 1 0 0 0 ...
>  str(sapply(1:3, function(x)diag(x,5,2), simplify=TRUE))
 int [1:10, 1:3] 1 0 0 0 0 0 1 0 0 0 ...
>  str(sapply(1:3, function(x)diag(x,5,2), simplify=FALSE))
List of 3
 $ : int [1:5, 1:2] 1 0 0 0 0 0 1 0 0 0
 $ : int [1:5, 1:2] 2 0 0 0 0 0 2 0 0 0
 $ : int [1:5, 1:2] 3 0 0 0 0 0 3 0 0 0


Bill Dunlap
TIBCO Software
wdunlap tibco.com

On Tue, Mar 13, 2018 at 6:23 AM, Doran, Harold <HDoran at air.org> wrote:

> While working with sapply, the documentation states that the simplify
> argument will yield a vector, matrix etc "when possible". I was curious how
> the code actually defined "as possible" and see this within the function
>
> if (!identical(simplify, FALSE) && length(answer))
>
> This seems superfluous to me, in particular this part:
>
> !identical(simplify, FALSE)
>
> The preceding code could be reduced to
>
> if (simplify && length(answer))
>
> and it would not need to execute the call to identical in order to trigger
> the conditional execution, which is known from the user's simplify = TRUE
> or FALSE inputs. I *think* the extra call to identical is just unnecessary
> overhead in this instance.
>
> Take for example, the following toy example code and benchmark results and
> a small modification to sapply:
>
> myList <- list(a = rnorm(100), b = rnorm(100))
>
> answer <- lapply(X = myList, FUN = length)
> simplify = TRUE
>
> library(microbenchmark)
>
> mySapply <- function (X, FUN, ..., simplify = TRUE, USE.NAMES = TRUE){
>         FUN <- match.fun(FUN)
>     answer <- lapply(X = X, FUN = FUN, ...)
>     if (USE.NAMES && is.character(X) && is.null(names(answer)))
>         names(answer) <- X
>     if (simplify && length(answer))
>         simplify2array(answer, higher = (simplify == "array"))
>     else answer
> }
>
>
> > microbenchmark(sapply(myList, length), times = 10000L)
> Unit: microseconds
>                    expr    min     lq     mean median     uq    max neval
>  sapply(myList, length) 14.156 15.572 16.67603 15.926 16.634 650.46 10000
> > microbenchmark(mySapply(myList, length), times = 10000L)
> Unit: microseconds
>                      expr    min     lq     mean median     uq      max
> neval
>  mySapply(myList, length) 13.095 14.864 16.02964 15.218 15.573 1671.804
> 10000
>
> My benchmark timings show a timing improvement with only that small change
> made and it is seemingly nominal. In my actual work, the sapply function is
> called millions of times and this additional overhead propagates to some
> overall additional computing time.
>
> I have done some limited testing on various real data to verify that the
> objects produced under both variants of the sapply (base R and my modified)
> yield identical objects when simply is both TRUE or FALSE.
>
> Perhaps someone else sees a counterexample where my proposed fix does not
> cause for sapply to behave as expected.
>
> Harold
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/
> posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

	[[alternative HTML version deleted]]



More information about the R-help mailing list