[R] Possible Improvement to sapply
William Dunlap
wdunlap at tibco.com
Tue Mar 13 17:10:55 CET 2018
Wouldn't that change how simplify='array' is handled?
> str(sapply(1:3, function(x)diag(x,5,2), simplify="array"))
int [1:5, 1:2, 1:3] 1 0 0 0 0 0 1 0 0 0 ...
> str(sapply(1:3, function(x)diag(x,5,2), simplify=TRUE))
int [1:10, 1:3] 1 0 0 0 0 0 1 0 0 0 ...
> str(sapply(1:3, function(x)diag(x,5,2), simplify=FALSE))
List of 3
$ : int [1:5, 1:2] 1 0 0 0 0 0 1 0 0 0
$ : int [1:5, 1:2] 2 0 0 0 0 0 2 0 0 0
$ : int [1:5, 1:2] 3 0 0 0 0 0 3 0 0 0
Bill Dunlap
TIBCO Software
wdunlap tibco.com
On Tue, Mar 13, 2018 at 6:23 AM, Doran, Harold <HDoran at air.org> wrote:
> While working with sapply, the documentation states that the simplify
> argument will yield a vector, matrix etc "when possible". I was curious how
> the code actually defined "as possible" and see this within the function
>
> if (!identical(simplify, FALSE) && length(answer))
>
> This seems superfluous to me, in particular this part:
>
> !identical(simplify, FALSE)
>
> The preceding code could be reduced to
>
> if (simplify && length(answer))
>
> and it would not need to execute the call to identical in order to trigger
> the conditional execution, which is known from the user's simplify = TRUE
> or FALSE inputs. I *think* the extra call to identical is just unnecessary
> overhead in this instance.
>
> Take for example, the following toy example code and benchmark results and
> a small modification to sapply:
>
> myList <- list(a = rnorm(100), b = rnorm(100))
>
> answer <- lapply(X = myList, FUN = length)
> simplify = TRUE
>
> library(microbenchmark)
>
> mySapply <- function (X, FUN, ..., simplify = TRUE, USE.NAMES = TRUE){
> FUN <- match.fun(FUN)
> answer <- lapply(X = X, FUN = FUN, ...)
> if (USE.NAMES && is.character(X) && is.null(names(answer)))
> names(answer) <- X
> if (simplify && length(answer))
> simplify2array(answer, higher = (simplify == "array"))
> else answer
> }
>
>
> > microbenchmark(sapply(myList, length), times = 10000L)
> Unit: microseconds
> expr min lq mean median uq max neval
> sapply(myList, length) 14.156 15.572 16.67603 15.926 16.634 650.46 10000
> > microbenchmark(mySapply(myList, length), times = 10000L)
> Unit: microseconds
> expr min lq mean median uq max
> neval
> mySapply(myList, length) 13.095 14.864 16.02964 15.218 15.573 1671.804
> 10000
>
> My benchmark timings show a timing improvement with only that small change
> made and it is seemingly nominal. In my actual work, the sapply function is
> called millions of times and this additional overhead propagates to some
> overall additional computing time.
>
> I have done some limited testing on various real data to verify that the
> objects produced under both variants of the sapply (base R and my modified)
> yield identical objects when simply is both TRUE or FALSE.
>
> Perhaps someone else sees a counterexample where my proposed fix does not
> cause for sapply to behave as expected.
>
> Harold
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/
> posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
[[alternative HTML version deleted]]
More information about the R-help
mailing list