[BioC] Integer overflow when summing an 'integer' Rle
Nicolas Delhomme
delhomme at embl.de
Wed Sep 5 12:59:10 CEST 2012
Great!
Thanks,
Nico
---------------------------------------------------------------
Nicolas Delhomme
Genome Biology Computational Support
European Molecular Biology Laboratory
Tel: +49 6221 387 8310
Email: nicolas.delhomme at embl.de
Meyerhofstrasse 1 - Postfach 10.2209
69102 Heidelberg, Germany
---------------------------------------------------------------
On 4 Sep 2012, at 22:16, Valerie Obenchain wrote:
> Hi Nico,
>
> The following fixes have been applied to IRanges 1.15.43
>
> (1) The 'Integer overflow' warning thrown by sum() on an integer-Rle is now more appropriate,
>
> library(IRanges)
> x <- Rle(values=as.integer(c(1, 2^31 -1, 1)))
> > sum(x)
> [1] NA
> Warning message:
> In sum(runValue(x) * runLength(x), ..., na.rm = na.rm) :
> Integer overflow - use runValue(.) <- as.numeric(runValue(.))
>
> (2) integers are coerced to numeric when calling mean() on an integer-Rle
>
> > mean(x)
> [1] 715827883
>
> Valerie
>
>
>
> ## Paste of original correspondence between Nico and Herve
>
> [BioC] Integer overflow when summing an 'integer' Rle
> Nicolas Delhomme delhomme at embl.de
> Tue Feb 14 17:35:48 CET 2012
>
> Salut Hervé,
>
> Bonne année! Well, we're already mid-Feb, but still most of it is in front of us ;-)
>
> On 10 Feb 2012, at 19:30, Hervé Pagès wrote:
>
> > Salut Nico,
> >
> > On 02/10/2012 08:04 AM, Nicolas Delhomme wrote:
> >> Hi all,
> >>
> >> While calculating some statistics of an RNA-seq experiment I tumbled onto the following problem. Applying the IRanges coverage function to my IRanges, I get back an integer Rle object. However trying to get the mean or sum of that Rle object results in an integer overflow. The following example just exemplify that overflow.
> >>
> >> library(IRanges)
> >> rC<- Rle(values=as.integer(c(1,(2^31)-1,1)))
> >> sum(rC)
> >> mean(rC)
> >>
> >> Both result in an integer overflow.
> >>
> >> [1] NA
> >> Warning message:
> >> In sum(runValue(x) * runLength(x), ..., na.rm = na.rm) :
> >> Integer overflow - use sum(as.numeric(.))
> >>
> >> The solution to that is to do the following:
> >>
> >> sum(as.numeric(runLength(rC) * runValue(rC)))
> >
> > Another solution is to convert the 'integer' Rle into a 'numeric' Rle
> > before doing sum(). Unfortunately, since we don't have separate
> > classes for those (like for example an IntegerRle and a DoubleRle
> > class) it cannot be done using direct coercion i.e. with something
> > like:
> >
> > as(rC, "DoubleRle")
> >
> > (Maybe we should have individual Rle subclasses for 'integer' Rle,
> > 'numeric' Rle, 'logical' Rle, 'character' Rle, 'factor' Rle etc...)
> >
>
> That could be useful. I, a few times, had to do quite some conversions to go back and forth between different Rle "kinds". Having subclasses would be great.
>
> > So for now, this conversion must be done with:
> >
> > > class(runValue(rC)) <- "double"
> > > rC
> > 'numeric' Rle of length 3 with 3 runs
> > Lengths: 1 1 1
> > Values : 1 2147483647 1
> >
> > This works fine with an Rle, but not so much with an RleList where
> > one needs to do some ugly contortions in order to succeed.
>
> Well, I ended up doing that in an lapply and it works just fine. Not the most efficient memory wise though.
>
> >
> > Alternatively to having individual Rle subclasses maybe we could have
> > an accessor e.g. rleValueType(), with getter and setters, so we could
> > do:
> >
> > > rleValueType(rC)
> > [1] "integer"
> > > rleValueType(rC) <- "double"
> >
> > and that would work on Rle and RleList objects.
> >
>
> That would indeed be very useful and probably easier to implement.
>
> > Anyway, even though I think having an easy/unified way for changing
> > the type of the values in Rle/RleList objects is important, maybe
> > I'm going slightly off-topic.
> >
> > What we should definitely do now is replace this warning:
> >
> > Warning message:
> > In sum(runValue(x) * runLength(x), ..., na.rm = na.rm) :
> > Integer overflow - use sum(as.numeric(.))
> >
> > by a more appropriate one (doing as.numeric() on an Rle is not a good
> > idea).
> >
>
> Indeed.
>
>
> >>
> >> but IMO it should be handled at the Rle level code; i.e. an integer Rle can clearly have a sum, a mean, etc... result that involve calculating values outside the integer range.
> >
> > I agree for mean() so I'll fix that.
> >
> > But for sum()... "calculating values outside the integer range",
> > even if the result of this calculation itself is not in the
> > integer range? base::sum() will return NA if the result is not in
> > the integer range and I think that's the right thing to do.
> > I don't like the idea of sum() returning a double when the input
> > is integer.
> >
>
> I'm on the same page here. Consistency (especially for R) is crucial. Under these conditions, having a meaningful warning would indeed be the best.
>
> Thanks for the detailed answer and for the slightly-off topic "diversion" .
>
> Cheers,
>
> Nico
>
> > Cheers,
> > H.
> >
> >> Is there anything that speaks again having these functions internally converting the integer values to numeric before calculating the sum or mean?
> >>
> >> Looking forward to hearing your thoughts on this,
> >>
> >> Cheers,
> >>
> >> Nico
> >>
> >> sessionInfo()
> >> R Under development (unstable) (2012-02-07 r58290)
> >> Platform: x86_64-apple-darwin10.8.0 (64-bit)
> >>
> >> locale:
> >> [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
> >>
> >> attached base packages:
> >> [1] stats graphics grDevices utils datasets methods base
> >>
> >> other attached packages:
> >> [1] IRanges_1.13.24 BiocGenerics_0.1.4
> >>
> >> loaded via a namespace (and not attached):
> >> [1] tools_2.15.0
> >>
> >>
> >>
> >> ---------------------------------------------------------------
> >> Nicolas Delhomme
> >>
> >> Genome Biology Computational Support
> >>
> >> European Molecular Biology Laboratory
> >>
> >> Tel: +49 6221 387 8310
> >> Email: nicolas.delhomme at embl.de
> >> Meyerhofstrasse 1 - Postfach 10.2209
> >> 69102 Heidelberg, Germany
> >>
> >> _______________________________________________
> >> Bioconductor mailing list
> >> Bioconductor at r-project.org
> >> https://stat.ethz.ch/mailman/listinfo/bioconductor
> >> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
> >
> >
> > --
> > Hervé Pagès
> >
> > Program in Computational Biology
> > Division of Public Health Sciences
> > Fred Hutchinson Cancer Research Center
> > 1100 Fairview Ave. N, M1-B514
> > P.O. Box 19024
> > Seattle, WA 98109-1024
> >
> > E-mail: hpages at fhcrc.org
> > Phone: (206) 667-5791
> > Fax: (206) 667-1319
>
> * Previous message: [BioC] Integer overflow when summing an 'integer' Rle
> * Next message: [BioC] about library size and length of gene information in DEseq
> * Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
>
> More information about the Bioconductor mailing list
>
>
>
More information about the Bioconductor
mailing list