[BioC] Using summarizeOverlaps with multiple samples/readgroups in a single bam file?

Martin Morgan mtmorgan at fhcrc.org
Thu Jan 24 01:45:57 CET 2013


On 01/23/2013 04:00 PM, Ryan C. Thompson wrote:
> I've been thinking about this some more, and I don't think there's any inherent
> reason that one cannot parallelize access to multiple read groups in a single
> bam file, because I have previously successfully sped up bam file reading by
> parallelizing across chromosomes. I think it would be convenient to have all the
> data for all the samples in an experiment in a single file. If Rsamtools
> supported filtering by read groups using some kind of option to scanBamParam
> (does it?), I think it would be sufficient to take a vectorized param argument
> to summarizeOverlaps. Then one could pass a list with one scanBamParam for each
> read group and get parallel counting of multiple read groups from a single bam
> file.

If someone can point me to a reasonable publicly available BAM file with read 
groups I'd be happy to explore this a bit. Rsamtools doesn't (yet?) support 
filtering by read group. Martin

>
> What do you think?
>
> On Sat 12 Jan 2013 12:53:36 PM PST, Martin Morgan wrote:
>> On 1/12/2013 12:29 PM, Ryan C. Thompson wrote:
>>> Hi all,
>>>
>>> I'm looking at simplifying my differential expression pipeline a
>>> little bit by
>>> merging all my input bam files into one bam file with multiple
>>> samples/read
>>> groups and then using that bam file as input to summarizeOverlaps. Is
>>> this
>>> supported in any way? I've never worked with sam read groups before
>>> (I always
>>> just did one sample per file), so I don't really know anything about
>>> them.
>>>
>>> So is it supported to take a single bam file and use
>>> summarizeOverlaps or some
>>> other mechanism to get a SummarizedExperiment object with one column
>>> for each
>>> sample in the bam file, rather than one column per file?
>>
>> Rsamtools doesn't do anything special with read groups (e.g., no
>> pre-filtering) and summarizeOverlaps doesn't do per-read-group
>> counting (one can provide one's own counting function to
>> summarizedOverlaps, though...) Also, parallelizing over bam files is a
>> simple way to get better throughput (providing a BamFileList as the
>> second argument to summarizeOverlaps, and with 'parallel' on the
>> search path, currently uses mclapply and memory-efficient iteration to
>> populate the SummarizedExperiment), so in some ways one large bam file
>> is a step in a counter-productive direction.
>>
>> Martin
>>
>>>
>>> -Ryan Thompson
>>>
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at r-project.org
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives:
>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>>


-- 
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M1 B861
Phone: (206) 667-2793



More information about the Bioconductor mailing list