[BioC] Integer overflow when summing an 'integer' Rle

Nicolas Delhomme delhomme at embl.de
Wed Sep 5 12:59:10 CEST 2012


Great!

Thanks,

Nico
---------------------------------------------------------------
Nicolas Delhomme

Genome Biology Computational Support

European Molecular Biology Laboratory

Tel: +49 6221 387 8310
Email: nicolas.delhomme at embl.de
Meyerhofstrasse 1 - Postfach 10.2209
69102 Heidelberg, Germany
---------------------------------------------------------------





On 4 Sep 2012, at 22:16, Valerie Obenchain wrote:

> Hi Nico,
> 
> The following fixes have been applied to IRanges 1.15.43
> 
> (1) The 'Integer overflow' warning thrown by sum() on an integer-Rle is now more appropriate,
> 
> library(IRanges)
> x <- Rle(values=as.integer(c(1, 2^31 -1, 1)))
> > sum(x)
> [1] NA
> Warning message:
> In sum(runValue(x) * runLength(x), ..., na.rm = na.rm) :
>   Integer overflow - use runValue(.) <- as.numeric(runValue(.))
> 
> (2) integers are coerced to numeric when calling mean() on an integer-Rle
> 
> > mean(x)
> [1] 715827883
> 
> Valerie
> 
> 
> 
> ## Paste of original correspondence between Nico and Herve 
> 
> [BioC] Integer overflow when summing an 'integer' Rle
> Nicolas Delhomme delhomme at embl.de
> Tue Feb 14 17:35:48 CET 2012
> 
> Salut Hervé,
> 
> Bonne année! Well, we're already mid-Feb, but still most of it is in front of us ;-)
> 
> On 10 Feb 2012, at 19:30, Hervé Pagès wrote:
> 
> > Salut Nico,
> > 
> > On 02/10/2012 08:04 AM, Nicolas Delhomme wrote:
> >> Hi all,
> >> 
> >> While calculating some statistics of an RNA-seq experiment I tumbled onto the following problem. Applying the IRanges coverage function to my IRanges, I get back an integer Rle object. However trying to get the mean or sum of that Rle object results in an integer overflow. The following example just exemplify that overflow.
> >> 
> >> library(IRanges)
> >> rC<- Rle(values=as.integer(c(1,(2^31)-1,1)))
> >> sum(rC)
> >> mean(rC)
> >> 
> >> Both result in an integer overflow.
> >> 
> >> [1] NA
> >> Warning message:
> >> In sum(runValue(x) * runLength(x), ..., na.rm = na.rm) :
> >>   Integer overflow - use sum(as.numeric(.))
> >> 
> >> The solution to  that is to do the following:
> >> 
> >> sum(as.numeric(runLength(rC) * runValue(rC)))
> > 
> > Another solution is to convert the 'integer' Rle into a 'numeric' Rle
> > before doing sum(). Unfortunately, since we don't have separate
> > classes for those (like for example an IntegerRle and a DoubleRle
> > class) it cannot be done using direct coercion i.e. with something
> > like:
> > 
> >  as(rC, "DoubleRle")
> > 
> > (Maybe we should have individual Rle subclasses for 'integer' Rle,
> > 'numeric' Rle, 'logical' Rle, 'character' Rle, 'factor' Rle etc...)
> > 
> 
> That could be useful. I, a few times, had to do quite some conversions to go back and forth between different Rle "kinds". Having subclasses would be great.
> 
> > So for now, this conversion must be done with:
> > 
> > > class(runValue(rC)) <- "double"
> > > rC
> > 'numeric' Rle of length 3 with 3 runs
> >  Lengths:          1          1          1
> >  Values :          1 2147483647          1
> > 
> > This works fine with an Rle, but not so much with an RleList where
> > one needs to do some ugly contortions in order to succeed.
> 
> Well, I ended up doing that in an lapply and it works just fine. Not the most efficient memory wise though.
> 
> > 
> > Alternatively to having individual Rle subclasses maybe we could have
> > an accessor e.g. rleValueType(), with getter and setters, so we could
> > do:
> > 
> > > rleValueType(rC)
> > [1] "integer"
> > > rleValueType(rC) <- "double"
> > 
> > and that would work on Rle and RleList objects.
> > 
> 
> That would indeed be very useful and probably easier to implement.
> 
> > Anyway, even though I think having an easy/unified way for changing
> > the type of the values in Rle/RleList objects is important, maybe
> > I'm going slightly off-topic.
> > 
> > What we should definitely do now is replace this warning:
> > 
> >  Warning message:
> >  In sum(runValue(x) * runLength(x), ..., na.rm = na.rm) :
> >     Integer overflow - use sum(as.numeric(.))
> > 
> > by a more appropriate one (doing as.numeric() on an Rle is not a good
> > idea).
> > 
> 
> Indeed.
> 
> 
> >> 
> >> but IMO it should be handled at the Rle level code; i.e. an integer Rle can clearly have a sum, a mean, etc... result that involve calculating values outside the integer range.
> > 
> > I agree for mean() so I'll fix that.
> > 
> > But for sum()... "calculating values outside the integer range",
> > even if the result of this calculation itself is not in the
> > integer range? base::sum() will return NA if the result is not in
> > the integer range and I think that's the right thing to do.
> > I don't like the idea of sum() returning a double when the input
> > is integer.
> > 
> 
> I'm on the same page here. Consistency (especially for R) is crucial. Under these conditions, having a meaningful warning would indeed be the best.
> 
> Thanks for the detailed answer and for the slightly-off topic "diversion" .
> 
> Cheers,
> 
> Nico
> 
> > Cheers,
> > H.
> > 
> >> Is there anything that speaks again having these functions internally converting the integer values to numeric before calculating the sum or mean?
> >> 
> >> Looking forward to hearing your thoughts on this,
> >> 
> >> Cheers,
> >> 
> >> Nico
> >> 
> >> sessionInfo()
> >> R Under development (unstable) (2012-02-07 r58290)
> >> Platform: x86_64-apple-darwin10.8.0 (64-bit)
> >> 
> >> locale:
> >> [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
> >> 
> >> attached base packages:
> >> [1] stats     graphics  grDevices utils     datasets  methods   base
> >> 
> >> other attached packages:
> >> [1] IRanges_1.13.24    BiocGenerics_0.1.4
> >> 
> >> loaded via a namespace (and not attached):
> >> [1] tools_2.15.0
> >> 
> >> 
> >> 
> >> ---------------------------------------------------------------
> >> Nicolas Delhomme
> >> 
> >> Genome Biology Computational Support
> >> 
> >> European Molecular Biology Laboratory
> >> 
> >> Tel: +49 6221 387 8310
> >> Email: nicolas.delhomme at embl.de
> >> Meyerhofstrasse 1 - Postfach 10.2209
> >> 69102 Heidelberg, Germany
> >> 
> >> _______________________________________________
> >> Bioconductor mailing list
> >> Bioconductor at r-project.org
> >> https://stat.ethz.ch/mailman/listinfo/bioconductor
> >> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
> > 
> > 
> > -- 
> > Hervé Pagès
> > 
> > Program in Computational Biology
> > Division of Public Health Sciences
> > Fred Hutchinson Cancer Research Center
> > 1100 Fairview Ave. N, M1-B514
> > P.O. Box 19024
> > Seattle, WA 98109-1024
> > 
> > E-mail: hpages at fhcrc.org
> > Phone:  (206) 667-5791
> > Fax:    (206) 667-1319
> 
>     * Previous message: [BioC] Integer overflow when summing an 'integer' Rle
>     * Next message: [BioC] about library size and length of gene information in DEseq
>     * Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
> 
> More information about the Bioconductor mailing list
> 
> 
> 



More information about the Bioconductor mailing list