[BioC] Manipulating large DNAStringSet(s)

Steve Lianoglou mailinglist.honeypot at gmail.com
Thu May 6 04:30:52 CEST 2010


Hi Hervé,

Thanks for coming through once again!

-steve

2010/5/5 Hervé Pagès <hpages at fhcrc.org>:
> Hi Steve,
>
> Thanks for reporting this! I fixed this problem in IRanges 1.6.1
> (release) and 1.7.1 (devel). Both will be available thru biocLite()
> in the next 24 hours or so.
>
> library(BSgenome.Hsapiens.UCSC.hg18)
> ranges <- IRanges(seq(1, seqlengths(Hsapiens)['chr1'] - 100,
> length.out=2141404), width=15)
> strands <- sample(c('+', '-'), length(ranges), replace=TRUE)
> is.pos <- strands == '+'
> v <- Views(Hsapiens$chr1, ranges)
>
> strings <- DNAStringSet(v)
>
> ## Takes about 2.5 sec on my laptop:
> strings[!is.pos] <- reverseComplement(strings[!is.pos])
>
> Cheers,
> H.
>
> Steve Lianoglou wrote:
>>
>> Hi all,
>>
>> I have a rather large DNAStringSet you could get from, say,
>> subselecting many intervals on Hsapiens$chr1 and I want to go through
>> the paces of reverseComplement-ing some of the ranges that correspond
>> to reads on the anti-sense strand.
>>
>> I'm finding that doing this the "naive" way is taking a lot of time
>> and memory, but adding some extra code to change my target to a
>> character vector (from a DNAStringSet) with some careful subsetting
>> and reverseComplement-ing is much faster.
>>
>> So I'm wondering if I'm either doing something wrong, or perhaps
>> something in DNAStringSet can be optimized to deal with this?
>>
>> Like so:
>>
>> R> library(BSgenome.Hsapiens.UCSC.hg18)
>> R> ranges <- IRanges(seq(1, seqlengths(Hsapiens)['chr1'] - 100,
>> length.out=2141404), width=15)
>> R> strands <- sample(c('+', '-'), length(ranges), replace=TRUE)
>> R> is.pos <- strands == '+'
>> R> v <- Views(Hsapiens$chr1, ranges) ## Pay no mind to the NNN's in
>> the tail here
>>
>> I now want to reverseComplement the strings in v where strands == '-'
>>
>> R> strings <- DNAStringSet(v)
>> R> strings[!is.pos] <- reverseComplement(strings[!is.pos])
>>
>> That above command takes a very long time to complete ... it's
>> actually been over a few minutes on my machine, so I'm killing it.
>>
>> Now let's do it another way, using normal character vectors and stuff
>>
>> R> strings <- DNAStringSet(v)
>> R> result <- character(length(strings))
>> R> result[is.pos] <- as.character(strings[is.pos])
>> R> result[!is.pos] <- as.character(reverseComplement(strings[!is.pos]))
>> ## and optionally ##
>> R> result <- DNAStringSet(result)
>>
>> Just curious if there's a better way?
>>
>> R> sessionInfo()
>> R version 2.11.0 (2010-04-22)
>> x86_64-apple-darwin9.8.0
>>
>> locale:
>> [1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8
>>
>> attached base packages:
>> [1] stats     graphics  grDevices utils     datasets  methods   base
>>
>> other attached packages:
>> [1] BSgenome.Hsapiens.UCSC.hg18_1.3.16 BSgenome_1.16.0
>>   Biostrings_2.16.0                  GenomicRanges_1.0.1
>> [5] IRanges_1.6.0
>>
>> loaded via a namespace (and not attached):
>> [1] Biobase_2.8.0
>>
>>
>
> --
> Hervé Pagès
>
> Program in Computational Biology
> Division of Public Health Sciences
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N, M2-B876
> P.O. Box 19024
> Seattle, WA 98109-1024
>
> E-mail: hpages at fhcrc.org
> Phone:  (206) 667-5791
> Fax:    (206) 667-1319
>



-- 
Steve Lianoglou
Graduate Student: Computational Systems Biology
 | Memorial Sloan-Kettering Cancer Center
 | Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact



More information about the Bioconductor mailing list