[BioC] IRanges coverage integer limit?
Hervé Pagès
hpages at fhcrc.org
Tue Jul 3 20:00:17 CEST 2012
On 07/03/2012 09:40 AM, Nicolas Delhomme wrote:
> Hi,
>
> I've just discovered that the IRanges coverage function would "overflow" without warnings. Below is an example that reproduce it:
>
> library(IRanges)
> rngs <- IRanges(c(1:100),width=100)
> coverage(rngs)
>
> 'integer' Rle of length 199 with 199 runs
> Lengths: 1 1 1 1 1 1 1 1 1 1 1 ... 1 1 1 1 1 1 1 1 1 1
> Values : 1 2 3 4 5 6 7 8 9 10 11 ... 10 9 8 7 6 5 4 3 2 1
>
> coverage(rngs,weight=1e9)
>
> 'integer' Rle of length 200 with 200 runs
> Lengths: 1 1 1 ... 1 1
> Values : 1000000000 2000000000 -1294967296 ... 1000000000 0
>
> runValue(coverage(rngs,weight=1e9))
> [1] 1000000000 2000000000 -1294967296 -294967296 705032704 1705032704
> [7] -1589934592 -589934592 410065408 1410065408 -1884901888 -884901888
> ...
>
> Clearly, the third position that has a coverage of 3 (not weighted) has a 3e9 weighted one which is > 2^31 (signed integer limit on most machine). I'm just surprised that it is silently ignored.
>
> For NGS, getting a bp coverage > 2^31 is unlikely, although I've already seen extremely high coverage for Ribosomal-like protein that were only 10 order of magnitude away (~2M X). This limits the ranges of weights that can be used (weight as of now can only be integers), i.e. a weight of 100 would already be borderline.
>
> Is there a way around this, coverage being such a very handy function? I understand that weight being integers probably makes computation faster, but what could be the overhead of allowing numeric instead? And I don't mind looking under the hood if that helps.
Thanks Nico for catching this other one. I will keep operations in the
int space for now (so an 'integer' Rle is always returned) but will make
sure a warning is issued and NAs are returned in case of overflow.
H.
>
> Cheers,
>
> Nico
>
> sessionInfo()
> R version 2.15.1 (2012-06-22)
> Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)
>
> locale:
> [1] C/UTF-8/C/C/C/C
>
> attached base packages:
> [1] stats graphics grDevices utils datasets methods base
>
> other attached packages:
> [1] IRanges_1.15.17 BiocGenerics_0.3.0
>
> loaded via a namespace (and not attached):
> [1] stats4_2.15.1 tools_2.15.1
>
>
> ---------------------------------------------------------------
> Nicolas Delhomme
>
> Genome Biology Computational Support
>
> European Molecular Biology Laboratory
>
> Tel: +49 6221 387 8310
> Email: nicolas.delhomme at embl.de
> Meyerhofstrasse 1 - Postfach 10.2209
> 69102 Heidelberg, Germany
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>
--
Hervé Pagès
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024
E-mail: hpages at fhcrc.org
Phone: (206) 667-5791
Fax: (206) 667-1319
More information about the Bioconductor
mailing list