[BioC] [devteam-bioc] Can you give me a sample to calculate the Read Depth with Rsamtools?
Martin Morgan
mtmorgan at fhcrc.org
Wed Jan 23 01:38:09 CET 2013
On 1/22/2013 7:54 AM, Renjie Tan wrote:
> Hi Martin,
>
> Thank you for your answer. I still have some questions about that.
>
> 1, the seqinfo() only result to 86 sequences for a really bam file. So I know the 86 sequence is not the read sequence. What the 86 sequences are? And where are the reads?
seqinfo is returning information about the sequences to which reads have been
aligned, e.g., chr1, chr2, chr3, .... This is read from the BAM header file,
probably added by the aligner you used.
> 2, " cvg <- coverage(bf)" this step really time costly. It needs more than 10 mins for a really 5G bam file with a cluster. If I just want to calculate some parts of region's RD. should us read the whole bam file into RAM? If so how can I parse hundreds of bam samples?
In the release version, probably the easiest thing to do is to create a
'GRanges' describing the regions that you are interested in. You could create it
by hand if there were only a few regions, but maybe you'd create it by
manipulating some annotation resource of one sort or another. By hand, and from
your earlier email
>> Y:2595298-2595418
>>
>> Y:2601341-2601461
>>
>> Y:2606003-2606363
maybe you would do
starts <- c(2595298, 2601341, 2606003)
ends <- c(2595418, 2601461, 2606363)
roi <- GRanges("Y", IRanges(starts, ends), seqinfo=si)
to define your 'regions of interest', and then use this to input just the reads
that overlap (including partially) those regions
param <- ScanBamParam(which=roi)
gal <- readBamGappedAlignments(bf, param=param)
This should be quick, provided there are not too many regions. Then
cvg <- coverage(gal)
brings you to
> 3, the Result's output format is like this? What's the " [ 9 9 8 9 9 8 8 8 7 7 8 8 9 10 7 6 6 ...]"parts mean? Which number is the RD for this region?
> start end width
> [1] 100 299 200 [ 9 9 8 9 9 8 8 8 7 7 8 8 9 10 7 6 6 ...]
> [2] 1000 1199 200 [35 36 38 38 39 40 40 39 39 39 40 39 40 43 43 45 44 ...]
This is a 'View' onto the coverage vector, and it represents the number of reads
over the nucleotide at coordinate 100 (9), 101 (9), 102 (8), etc. The
coordinates 100-299, 1000-1199 are in this view because in my response I
identified those as regions of interest. Obviously you would use 'roi' created
above.
> 4, the output format I expect is look like GATK's form. Such as
> Target total_coverage
> 1:14467-14587 261
I guess (but you can tell me if I'm wrong) that 'total_coverage' is the sum of
the Views vectors, so from below...
> lapply(v, viewSums)
$seq1
[1] 4126 8412
$seq2
integer(0)
which says that, in the first region of interest, reads contributed a total of
4126 nucleotides.
If one were analyzing many BAM files, it would make sense to write a function
that takes as an argument a single bam file and perhaps other arguments, and
then does whatever it is that you're interested in
fun <- function(bamFile, param) {
bf <- BamFile(bamFile)
gal <- readBamGappedAlignments(bf, param=param)
cvg <- coverage(gal)
## etc
}
then to apply it to a list of bam files
fls <- dir("some/where", pattern=".*bam$")
res <- lapply(fls, fun, param=param)
and then to parallelize it, e.g., on a single machine
library(parallel)
res <- mclapply(fls, fun, param=param)
It is also possible to process the files in a cluster with only a little more
difficulty, but that's probably a topic for a different email thread.
You might find the resources at
http://bioconductor.org/help/course-materials/2012/
e.g., the PDF at least
http://bioconductor.org/help/course-materials/2012/CSC2012/
useful, and there is an upcoming course
https://secure.bioconductor.org/Seattle-Feb-2013/
that would be suitable for this type of question
Martin
> 1:14639-14883 1974
> 1:14943-15064 3412
> 1:15671-15990 374
> 1:16591-16719 0
>
> I appreciate your kindly help.
>
> Thanks,
> Renjie
>
>
> -----Original Message-----
> From: Martin Morgan [mailto:mtmorgan at fhcrc.org]
> Sent: Monday, January 21, 2013 6:45 PM
> To: Renjie Tan
> Cc: Maintainer; bioconductor at r-project.org
> Subject: Re: [devteam-bioc] Can you give me a sample to calculate the Read Depth with Rsamtools?
>
> Hi Renjie -- it's better to ask these questions on the Bioconductor mailing list, e.g., through the 'guest posting' facility if you don't want to subscribe
>
> http://bioconductor.org/help/mailing-list/mailform/
>
> That way everyone benefits from both questions and answers. For a reproducible example I ran
>
> library(Rsamtools)
> example(scanBam)
>
> This creates a variable 'fl' that points to a small bam file. I made an instance of the BamFile class, then discovered the information about chromosomes in the file
>
> bf <- BamFile(fl)
> si <- seqinfo(bf)
>
> which told me that I have two short sequences
>
> > si
> Seqinfo of length 2
> seqnames seqlengths isCircular genome
> seq1 1575 NA <NA>
> seq2 1584 NA <NA>
>
> Suppose some 'regions of interest' were on 'seq1', e.g., windows of width 200, starting at position 100 and 1000.
>
> roi <- GRanges("seq1", IRanges(c(100, 1000), width=200), seqinfo=si)
>
> I could calculate coverage in my BAM file, just over my regions of interest
>
> cvg <- coverage(bf)
>
> and then create 'Views' on my regions of interest
>
> v <- Map(Views, cvg, ranges(split(roi, seqnames(roi))))
>
> > v
> $seq1
> Views on a 1575-length Rle subject
>
> views:
> start end width
> [1] 100 299 200 [ 9 9 8 9 9 8 8 8 7 7 8 8 9 10 7 6 6 ...]
> [2] 1000 1199 200 [35 36 38 38 39 40 40 39 39 39 40 39 40 43 43 45 44 ...]
>
> $seq2
> Views on a 1584-length Rle subject
>
> views: NONE
>
> You could then access elements of this as, e.g., v$seq1[1] so for a playful animation
>
> for (i in seq_along(v$seq1)) {
> plot(v$seq1[[i]], type="l")
> Sys.sleep(2)
> }
>
> If using the 'devel' (to be 3.0) version of R, coverage(bf) takes an argument
> param=ScanBamParam(which=roi) which would just calculate coverage of (reads that
> overlap) your regions of interest; this would obviously save a lot of computation if you had just a few regions, or regions on a single chromosome.
>
> Martin
>
> On 01/21/2013 01:29 PM, Maintainer wrote:
>> Hi Martin,
>>
>> I am a new beginner Rsamtools user. I know that's really a powerful
>> tool. Can you give me a sample to calculate the Read Depth with Rsamtools?
>>
>> My input is a bam file(s) and a series regions like:
>>
>> Y:2595298-2595418
>>
>> Y:2601341-2601461
>>
>> Y:2606003-2606363
>>
>> ........................
>>
>> I need to calculate the *Read depth* for every regions above. Not the read count.
>>
>> Thanks,
>>
>> Renjie
>>
>>
>>
>> ______________________________________________________________________
>> __
>> devteam-bioc mailing list
>> To unsubscribe from this mailing list send a blank email to
>> devteam-bioc-leave at lists.fhcrc.org
>> You can also unsubscribe or change your personal options at
>> https://lists.fhcrc.org/mailman/listinfo/devteam-bioc
>>
>
>
> --
> Computational Biology / Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N.
> PO Box 19024 Seattle, WA 98109
>
> Location: Arnold Building M1 B861
> Phone: (206) 667-2793
>
--
Dr. Martin Morgan, PhD
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109
More information about the Bioconductor
mailing list