[BioC] IRanges coverage integer limit?
Nicolas Delhomme
delhomme at embl.de
Wed Jul 11 11:29:05 CEST 2012
Hi Hervé,
On Jul 10, 2012, at 7:44 PM, Hervé Pagès wrote:
> Hi Nico,
>
> The overflow issue is addressed in IRanges 1.15.18 (devel).
Thanks!
>
> On 07/04/2012 02:16 AM, Nicolas Delhomme wrote:
>> Great, thanks!
>>
>> Hervé - how much effort is it to extend it to numeric? I'm willing to do it, I just do not want to start on something where YOU would say it's though ;-)
>
> I don't think it would be tough at all. The real question is: do we want
> coverage() to always return a numeric-Rle instead of integer-Rle? This
> will make the Rle 50% bigger in memory, probably not a big deal. On the
> other hand this would allow treating numeric weights really as numerics
> instead of truncating them like we do right now:
>
> > coverage(IRanges(1:3, width=10))
> 'integer' Rle of length 12 with 5 runs
> Lengths: 1 1 8 1 1
> Values : 1 2 3 2 1
> > coverage(IRanges(1:3, width=10), weight=2.86)
> 'integer' Rle of length 12 with 5 runs
> Lengths: 1 1 8 1 1
> Values : 2 4 6 4 2
>
> Maybe one option would be to return an integer-Rle when 'weight' is
> integer and a numeric-Rle when it's numeric. So by default (i.e. when
> no weights are supplied) it would still return an integer-Rle (because
> the default for 'weight' is 1L).
> But coverage(IRanges(1:3, width=10), weight=2) would return a
> numeric-Rle and coverage(IRanges(1:3, width=10), weight=2L)
> an integer-Rle.
>
> How does that sound?
That sounds really great! I find it actually really intuitive, i.e. that's how I would expect it to behave.
Let me know if there's something I can do to help with the changes,
Nico
>
> H.
>
>>
>> Nico
>>
>> ---------------------------------------------------------------
>> Nicolas Delhomme
>>
>> Genome Biology Computational Support
>>
>> European Molecular Biology Laboratory
>>
>> Tel: +49 6221 387 8310
>> Email: nicolas.delhomme at embl.de
>> Meyerhofstrasse 1 - Postfach 10.2209
>> 69102 Heidelberg, Germany
>> ---------------------------------------------------------------
>>
>>
>>
>>
>>
>> On Jul 3, 2012, at 8:00 PM, Hervé Pagès wrote:
>>
>>> On 07/03/2012 09:40 AM, Nicolas Delhomme wrote:
>>>> Hi,
>>>>
>>>> I've just discovered that the IRanges coverage function would "overflow" without warnings. Below is an example that reproduce it:
>>>>
>>>> library(IRanges)
>>>> rngs <- IRanges(c(1:100),width=100)
>>>> coverage(rngs)
>>>>
>>>> 'integer' Rle of length 199 with 199 runs
>>>> Lengths: 1 1 1 1 1 1 1 1 1 1 1 ... 1 1 1 1 1 1 1 1 1 1
>>>> Values : 1 2 3 4 5 6 7 8 9 10 11 ... 10 9 8 7 6 5 4 3 2 1
>>>>
>>>> coverage(rngs,weight=1e9)
>>>>
>>>> 'integer' Rle of length 200 with 200 runs
>>>> Lengths: 1 1 1 ... 1 1
>>>> Values : 1000000000 2000000000 -1294967296 ... 1000000000 0
>>>>
>>>> runValue(coverage(rngs,weight=1e9))
>>>> [1] 1000000000 2000000000 -1294967296 -294967296 705032704 1705032704
>>>> [7] -1589934592 -589934592 410065408 1410065408 -1884901888 -884901888
>>>> ...
>>>>
>>>> Clearly, the third position that has a coverage of 3 (not weighted) has a 3e9 weighted one which is > 2^31 (signed integer limit on most machine). I'm just surprised that it is silently ignored.
>>>>
>>>> For NGS, getting a bp coverage > 2^31 is unlikely, although I've already seen extremely high coverage for Ribosomal-like protein that were only 10 order of magnitude away (~2M X). This limits the ranges of weights that can be used (weight as of now can only be integers), i.e. a weight of 100 would already be borderline.
>>>>
>>>> Is there a way around this, coverage being such a very handy function? I understand that weight being integers probably makes computation faster, but what could be the overhead of allowing numeric instead? And I don't mind looking under the hood if that helps.
>>>
>>> Thanks Nico for catching this other one. I will keep operations in the
>>> int space for now (so an 'integer' Rle is always returned) but will make
>>> sure a warning is issued and NAs are returned in case of overflow.
>>>
>>> H.
>>>
>>>>
>>>> Cheers,
>>>>
>>>> Nico
>>>>
>>>> sessionInfo()
>>>> R version 2.15.1 (2012-06-22)
>>>> Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)
>>>>
>>>> locale:
>>>> [1] C/UTF-8/C/C/C/C
>>>>
>>>> attached base packages:
>>>> [1] stats graphics grDevices utils datasets methods base
>>>>
>>>> other attached packages:
>>>> [1] IRanges_1.15.17 BiocGenerics_0.3.0
>>>>
>>>> loaded via a namespace (and not attached):
>>>> [1] stats4_2.15.1 tools_2.15.1
>>>>
>>>>
>>>> ---------------------------------------------------------------
>>>> Nicolas Delhomme
>>>>
>>>> Genome Biology Computational Support
>>>>
>>>> European Molecular Biology Laboratory
>>>>
>>>> Tel: +49 6221 387 8310
>>>> Email: nicolas.delhomme at embl.de
>>>> Meyerhofstrasse 1 - Postfach 10.2209
>>>> 69102 Heidelberg, Germany
>>>>
>>>> _______________________________________________
>>>> Bioconductor mailing list
>>>> Bioconductor at r-project.org
>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>
>>>
>>>
>>> --
>>> Hervé Pagès
>>>
>>> Program in Computational Biology
>>> Division of Public Health Sciences
>>> Fred Hutchinson Cancer Research Center
>>> 1100 Fairview Ave. N, M1-B514
>>> P.O. Box 19024
>>> Seattle, WA 98109-1024
>>>
>>> E-mail: hpages at fhcrc.org
>>> Phone: (206) 667-5791
>>> Fax: (206) 667-1319
>>>
>>>
>>
>
>
> --
> Hervé Pagès
>
> Program in Computational Biology
> Division of Public Health Sciences
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N, M1-B514
> P.O. Box 19024
> Seattle, WA 98109-1024
>
> E-mail: hpages at fhcrc.org
> Phone: (206) 667-5791
> Fax: (206) 667-1319
>
>
More information about the Bioconductor
mailing list