[BioC] Getting the length of every element from a large CompressedIRangesList is slow
Nicolas Delhomme
delhomme at embl.de
Tue Jul 3 09:36:00 CEST 2012
That's great! Thanks Hervé.
I remember seeing that in a thread in the mailing list, but couldn't recall it. And I couldn't find it in the documentation. Could it made more obvious by being added to the IRangesList Rd page, as part of the "see also" section, as well as in the IRangesList-utils Rd page? That would be great too :-)
Cheers,
Nico
---------------------------------------------------------------
Nicolas Delhomme
Genome Biology Computational Support
European Molecular Biology Laboratory
Tel: +49 6221 387 8310
Email: nicolas.delhomme at embl.de
Meyerhofstrasse 1 - Postfach 10.2209
69102 Heidelberg, Germany
---------------------------------------------------------------
On Jul 2, 2012, at 8:25 PM, Hervé Pagès wrote:
> Hi Nico,
>
> Even faster:
>
> > system.time(sizes <- elementLengths(exbytx))
> user system elapsed
> 0.000 0.000 0.001
>
> Note that you can use elementLengths on any list-like object
> ("list-like" = list or List class or subclass):
>
> > x <- rep(list(a=1:4, b=letters), 500000)
> > length(x)
> [1] 1000000
> > system.time(x_eltlens <- sapply(x, length))
> user system elapsed
> 3.132 0.008 3.142
> > system.time(x_eltlens2 <- elementLengths(x))
> user system elapsed
> 0.024 0.000 0.023
> > identical(x_eltlens, x_eltlens2)
> [1] TRUE
>
> HTH,
>
> H.
>
> On 07/02/2012 10:18 AM, Nicolas Delhomme wrote:
>> Hi,
>>
>> Just to extend on my previous message:
>>
>> Doing this instead is fast:
>>
>>> system.time(sizes <- sapply(width(aln.ranges),length))
>>
>> user system elapsed
>> 1.109 0.144 1.254
>>
>> Cheers,
>>
>> Nico
>>
>> ---------------------------------------------------------------
>> Nicolas Delhomme
>>
>> Genome Biology Computational Support
>>
>> European Molecular Biology Laboratory
>>
>> Tel: +49 6221 387 8310
>> Email: nicolas.delhomme at embl.de
>> Meyerhofstrasse 1 - Postfach 10.2209
>> 69102 Heidelberg, Germany
>> ---------------------------------------------------------------
>>
>>
>>
>>
>>
>> On Jul 2, 2012, at 7:02 PM, Nicolas Delhomme wrote:
>>
>>> Hej!
>>>
>>> I've a rather large CompressedIRangesList
>>>
>>>> print(object.size(aln.ranges),unit="Mb")
>>> 390.4 Mb
>>>
>>> that has 2518 elements, some of which having up to 6M ranges for a total of 51M, but the vast majority are small, the median is 2 while the mean is ~ 20,000 (the 3rd quartile has a value of 47).
>>>
>>> Retrieving the element length is slow:
>>>
>>>> system.time(sizes <- sapply(aln.ranges,length))
>>>
>>> user system elapsed
>>> 265.777 169.222 443.498
>>>
>>> by comparison to the performances of the IRanges package in general, which I was surprised of. Are there faster way to get this information than the sapply I'm using? Note that the machine I'm using is not a limiting factor in terms of CPU/RAM/load.
>>>
>>>> sessionInfo()
>>> R version 2.15.1 (2012-06-22)
>>> Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)
>>>
>>> locale:
>>> [1] C/UTF-8/C/C/C/C
>>>
>>> attached base packages:
>>> [1] stats graphics grDevices utils datasets methods base
>>>
>>> other attached packages:
>>> [1] IRanges_1.15.15 BiocGenerics_0.3.0
>>>
>>> loaded via a namespace (and not attached):
>>> [1] stats4_2.15.1
>>>
>>> Nico
>>>
>>> P.S. If you need, I can send my aln.ranges object off-list.
>>>
>>> ---------------------------------------------------------------
>>> Nicolas Delhomme
>>>
>>> Genome Biology Computational Support
>>>
>>> European Molecular Biology Laboratory
>>>
>>> Tel: +49 6221 387 8310
>>> Email: nicolas.delhomme at embl.de
>>> Meyerhofstrasse 1 - Postfach 10.2209
>>> 69102 Heidelberg, Germany
>>>
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at r-project.org
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>
>
> --
> Hervé Pagès
>
> Program in Computational Biology
> Division of Public Health Sciences
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N, M1-B514
> P.O. Box 19024
> Seattle, WA 98109-1024
>
> E-mail: hpages at fhcrc.org
> Phone: (206) 667-5791
> Fax: (206) 667-1319
>
>
More information about the Bioconductor
mailing list