[Rd] sum() returns NA on a long *logical* vector when nb of TRUE values exceeds 2^31
Martin Maechler
maechler at stat.math.ethz.ch
Wed Jun 7 12:54:19 CEST 2017
>>>>> Martin Maechler <maechler at stat.math.ethz.ch>
>>>>> on Tue, 6 Jun 2017 09:45:44 +0200 writes:
>>>>> Hervé Pagès <hpages at fredhutch.org>
>>>>> on Fri, 2 Jun 2017 04:05:15 -0700 writes:
>> Hi, I have a long numeric vector 'xx' and I want to use
>> sum() to count the number of elements that satisfy some
>> criteria like non-zero values or values lower than a
>> certain threshold etc...
>> The problem is: sum() returns an NA (with a warning) if
>> the count is greater than 2^31. For example:
>>> xx <- runif(3e9) sum(xx < 0.9)
>> [1] NA Warning message: In sum(xx < 0.9) : integer
>> overflow - use sum(as.numeric(.))
>> This already takes a long time and doing
>> sum(as.numeric(.)) would take even longer and require
>> allocation of 24Gb of memory just to store an
>> intermediate numeric vector made of 0s and 1s. Plus,
>> having to do sum(as.numeric(.)) every time I need to
>> count things is not convenient and is easy to forget.
>> It seems that sum() on a logical vector could be modified
>> to return the count as a double when it cannot be
>> represented as an integer. Note that length() already
>> does this so that wouldn't create a precedent. Also and
>> FWIW prod() avoids the problem by always returning a
>> double, whatever the type of the input is (except on a
>> complex vector).
>> I can provide a patch if this change sounds reasonable.
> This sounds very reasonable, thank you Hervé, for the
> report, and even more for a (small) patch.
I was made aware of the fact, that R treats logical and
integer very often identically in the C code, and in general we
even mention that logicals are treated as 0/1/NA integers in
arithmetic.
For the present case that would mean that we should also
safe-guard against *integer* overflow in sum(.) and that is
not something we have done / wanted to do in the past... Speed
being one reason.
So this ends up being more delicate than I had thought at first,
because changing sum(<logical>) only would mean that
sum(LOGI) and
sum(as.integer(LOGI))
would start differ for a logical vector LOGI.
So, for now this is something that must be approached carefully,
and the R Core team may want discuss "in private" first.
I'm sorry for having raised possibly unrealistic expectations.
Martin
> Martin
>> Cheers, H.
>> --
>> Hervé Pagès
>> Program in Computational Biology Division of Public
>> Health Sciences Fred Hutchinson Cancer Research Center
>> 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA
>> 98109-1024
>> E-mail: hpages at fredhutch.org Phone: (206) 667-5791 Fax:
>> (206) 667-1319
>> ______________________________________________
>> R-devel at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
More information about the R-devel
mailing list