[Rd] sum() returns NA on a long *logical* vector when nb of TRUE values exceeds 2^31
Hervé Pagès
hpages at fredhutch.org
Thu Jun 8 06:38:10 CEST 2017
Hi Martin,
On 06/07/2017 03:54 AM, Martin Maechler wrote:
>>>>>> Martin Maechler <maechler at stat.math.ethz.ch>
>>>>>> on Tue, 6 Jun 2017 09:45:44 +0200 writes:
>
>>>>>> Hervé Pagès <hpages at fredhutch.org>
>>>>>> on Fri, 2 Jun 2017 04:05:15 -0700 writes:
>
> >> Hi, I have a long numeric vector 'xx' and I want to use
> >> sum() to count the number of elements that satisfy some
> >> criteria like non-zero values or values lower than a
> >> certain threshold etc...
>
> >> The problem is: sum() returns an NA (with a warning) if
> >> the count is greater than 2^31. For example:
>
> >>> xx <- runif(3e9) sum(xx < 0.9)
> >> [1] NA Warning message: In sum(xx < 0.9) : integer
> >> overflow - use sum(as.numeric(.))
>
> >> This already takes a long time and doing
> >> sum(as.numeric(.)) would take even longer and require
> >> allocation of 24Gb of memory just to store an
> >> intermediate numeric vector made of 0s and 1s. Plus,
> >> having to do sum(as.numeric(.)) every time I need to
> >> count things is not convenient and is easy to forget.
>
> >> It seems that sum() on a logical vector could be modified
> >> to return the count as a double when it cannot be
> >> represented as an integer. Note that length() already
> >> does this so that wouldn't create a precedent. Also and
> >> FWIW prod() avoids the problem by always returning a
> >> double, whatever the type of the input is (except on a
> >> complex vector).
>
> >> I can provide a patch if this change sounds reasonable.
>
> > This sounds very reasonable, thank you Hervé, for the
> > report, and even more for a (small) patch.
>
> I was made aware of the fact, that R treats logical and
> integer very often identically in the C code, and in general we
> even mention that logicals are treated as 0/1/NA integers in
> arithmetic.
>
> For the present case that would mean that we should also
> safe-guard against *integer* overflow in sum(.) and that is
> not something we have done / wanted to do in the past... Speed
> being one reason.
>
> So this ends up being more delicate than I had thought at first,
> because changing sum(<logical>) only would mean that
>
> sum(LOGI) and
> sum(as.integer(LOGI))
>
> would start differ for a logical vector LOGI.
>
> So, for now this is something that must be approached carefully,
> and the R Core team may want discuss "in private" first.
>
> I'm sorry for having raised possibly unrealistic expectations.
No worries. Thanks for taking my proposal into consideration.
Note that the isum() function in src/main/summary.c is already using
a 64-bit accumulator to accommodate intermediate sums > INT_MAX.
So it should be easy to modify the function to make it overflow for
much bigger final sums without altering performance. Seems like
R_XLEN_T_MAX would be the natural threshold.
Cheers,
H.
> Martin
>
> > Martin
>
> >> Cheers, H.
>
> >> --
> >> Hervé Pagès
>
> >> Program in Computational Biology Division of Public
> >> Health Sciences Fred Hutchinson Cancer Research Center
> >> 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA
> >> 98109-1024
>
> >> E-mail: hpages at fredhutch.org Phone: (206) 667-5791 Fax:
> >> (206) 667-1319
>
> >> ______________________________________________
> >> R-devel at r-project.org mailing list
> >> https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.ethz.ch_mailman_listinfo_r-2Ddevel&d=DwIDAw&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=dyRNzyVdDYXzNX0sXIl5sdDqDXSxROm4-uM_XMquX_E&s=Qq6QdMWvudWgR_WGKdbBVNnVs5JO6s692MxjDo2JR9Y&e=
>
> > ______________________________________________
> > R-devel at r-project.org mailing list
> > https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.ethz.ch_mailman_listinfo_r-2Ddevel&d=DwIDAw&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=dyRNzyVdDYXzNX0sXIl5sdDqDXSxROm4-uM_XMquX_E&s=Qq6QdMWvudWgR_WGKdbBVNnVs5JO6s692MxjDo2JR9Y&e=
>
--
Hervé Pagès
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024
E-mail: hpages at fredhutch.org
Phone: (206) 667-5791
Fax: (206) 667-1319
More information about the R-devel
mailing list