[Rd] sum() returns NA on a long *logical* vector when nb of TRUE values exceeds 2^31

Martin Maechler maechler at stat.math.ethz.ch
Wed Jun 7 12:54:19 CEST 2017


>>>>> Martin Maechler <maechler at stat.math.ethz.ch>
>>>>>     on Tue, 6 Jun 2017 09:45:44 +0200 writes:

>>>>> Hervé Pagès <hpages at fredhutch.org>
>>>>>     on Fri, 2 Jun 2017 04:05:15 -0700 writes:

    >> Hi, I have a long numeric vector 'xx' and I want to use
    >> sum() to count the number of elements that satisfy some
    >> criteria like non-zero values or values lower than a
    >> certain threshold etc...

    >> The problem is: sum() returns an NA (with a warning) if
    >> the count is greater than 2^31. For example:

    >>> xx <- runif(3e9) sum(xx < 0.9)
    >> [1] NA Warning message: In sum(xx < 0.9) : integer
    >> overflow - use sum(as.numeric(.))

    >> This already takes a long time and doing
    >> sum(as.numeric(.)) would take even longer and require
    >> allocation of 24Gb of memory just to store an
    >> intermediate numeric vector made of 0s and 1s. Plus,
    >> having to do sum(as.numeric(.)) every time I need to
    >> count things is not convenient and is easy to forget.

    >> It seems that sum() on a logical vector could be modified
    >> to return the count as a double when it cannot be
    >> represented as an integer.  Note that length() already
    >> does this so that wouldn't create a precedent. Also and
    >> FWIW prod() avoids the problem by always returning a
    >> double, whatever the type of the input is (except on a
    >> complex vector).

    >> I can provide a patch if this change sounds reasonable.

    > This sounds very reasonable, thank you Hervé, for the
    > report, and even more for a (small) patch.

I was made aware of the fact, that R treats logical and
integer very often identically in the C code, and in general we
even mention that logicals are treated as 0/1/NA integers in
arithmetic.

For the present case that would mean that we should also
safe-guard against *integer* overflow in sum(.)  and that is
not something we have done / wanted to do in the past...  Speed
being one reason.

So this ends up being more delicate than I had thought at first,
because changing  sum(<logical>)  only would mean that

  sum(LOGI)   	  		  and
  sum(as.integer(LOGI))

would start differ for a logical vector LOGI.

So, for now this is something that must be approached carefully,
and the R Core team may want discuss "in private" first.

I'm sorry for having raised possibly unrealistic expectations.
Martin

    > Martin

    >> Cheers, H.

    >> -- 
    >> Hervé Pagès

    >> Program in Computational Biology Division of Public
    >> Health Sciences Fred Hutchinson Cancer Research Center
    >> 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA
    >> 98109-1024

    >> E-mail: hpages at fredhutch.org Phone: (206) 667-5791 Fax:
    >> (206) 667-1319

    >> ______________________________________________
    >> R-devel at r-project.org mailing list
    >> https://stat.ethz.ch/mailman/listinfo/r-devel

    > ______________________________________________
    > R-devel at r-project.org mailing list
    > https://stat.ethz.ch/mailman/listinfo/r-devel



More information about the R-devel mailing list