[BioC] [devteam-bioc] Can you give me a sample to calculate the Read Depth with Rsamtools?

Wed Jan 23 01:38:09 CET 2013

On 1/22/2013 7:54 AM, Renjie Tan wrote:
> Hi Martin,
>
> Thank you for your answer. I still have some questions about that.
>
> 1, the seqinfo() only result to 86 sequences for a really bam file. So I know the 86 sequence is not the read sequence. What the 86 sequences are? And where are the reads?

seqinfo is returning information about the sequences to which reads have been 
aligned, e.g., chr1, chr2, chr3, .... This is read from the BAM header file, 
probably added by the aligner you used.

> 2, " cvg <- coverage(bf)" this step really time costly. It needs more than 10 mins for a really 5G bam file with a cluster. If I just want to calculate some parts of region's RD. should us read the whole bam file into RAM? If so how can I parse hundreds of bam samples?

In the release version, probably the easiest thing to do is to create a 
'GRanges' describing the regions that you are interested in. You could create it 
by hand if there were only a few regions, but maybe you'd create it by 
manipulating some annotation resource of one sort or another. By hand, and from 
your earlier email

 >> Y:2595298-2595418
 >>
 >> Y:2601341-2601461
 >>
 >> Y:2606003-2606363

maybe you would do

   starts <- c(2595298, 2601341, 2606003)
   ends <- c(2595418, 2601461, 2606363)
   roi <- GRanges("Y", IRanges(starts, ends), seqinfo=si)

to define your 'regions of interest', and then use this to input just the reads 
that overlap (including partially) those regions

   param <- ScanBamParam(which=roi)
   gal <- readBamGappedAlignments(bf, param=param)

This should be quick, provided there are not too many regions. Then

   cvg <- coverage(gal)

brings you to

> 3, the Result's output format is like this? What's the " [ 9  9  8  9  9  8  8  8  7  7  8  8  9 10  7  6  6 ...]"parts mean? Which number is the RD for this region?
>       	start  end width
> 	[1]   100  299   200 [ 9  9  8  9  9  8  8  8  7  7  8  8  9 10  7  6  6 ...]
> 	[2]  1000 1199   200 [35 36 38 38 39 40 40 39 39 39 40 39 40 43 43 45 44 ...]

This is a 'View' onto the coverage vector, and it represents the number of reads 
over the nucleotide at coordinate 100 (9), 101 (9), 102 (8), etc. The 
coordinates 100-299, 1000-1199 are in this view because in my response I 
identified those as regions of interest. Obviously you would use 'roi' created 
above.

> 4, the output format I expect is look like GATK's form. Such as
> 	Target		total_coverage
> 	1:14467-14587		261

I guess (but you can tell me if I'm wrong) that 'total_coverage' is the sum of 
the Views vectors, so from below...

 > lapply(v, viewSums)
$seq1
[1] 4126 8412

$seq2
integer(0)

which says that, in the first region of interest, reads contributed a total of 
4126 nucleotides.

If one were analyzing many BAM files, it would make sense to write a function 
that takes as an argument a single bam file and perhaps other arguments, and 
then does whatever it is that you're interested in

   fun <- function(bamFile, param) {
         bf <- BamFile(bamFile)
         gal <- readBamGappedAlignments(bf, param=param)
         cvg <- coverage(gal)
         ## etc
   }

then to apply it to a list of bam files

    fls <- dir("some/where", pattern=".*bam$")
    res <- lapply(fls, fun, param=param)

and then to parallelize it, e.g., on a single machine

     library(parallel)
     res <- mclapply(fls, fun, param=param)

It is also possible to process the files in a cluster with only a little more 
difficulty, but that's probably a topic for a different email thread.

You might find the resources at

   http://bioconductor.org/help/course-materials/2012/

e.g., the PDF at least

   http://bioconductor.org/help/course-materials/2012/CSC2012/

useful, and there is an upcoming course

   https://secure.bioconductor.org/Seattle-Feb-2013/

that would be suitable for this type of question

Martin

> 	1:14639-14883		1974
> 	1:14943-15064		3412
> 	1:15671-15990		374
> 	1:16591-16719		0
>
> I appreciate your kindly help.
>
> Thanks,
> Renjie
>
>
> -----Original Message-----
> From: Martin Morgan [mailto:mtmorgan at fhcrc.org]
> Sent: Monday, January 21, 2013 6:45 PM
> To: Renjie Tan
> Cc: Maintainer; bioconductor at r-project.org
> Subject: Re: [devteam-bioc] Can you give me a sample to calculate the Read Depth with Rsamtools?
>
> Hi Renjie -- it's better to ask these questions on the Bioconductor mailing list, e.g., through the 'guest posting' facility if you don't want to subscribe
>
>     http://bioconductor.org/help/mailing-list/mailform/
>
> That way everyone benefits from both questions and answers. For a reproducible example I ran
>
>     library(Rsamtools)
>     example(scanBam)
>
> This creates a variable 'fl' that points to a small bam file. I made an instance of the BamFile class, then discovered the information about chromosomes in the file
>
>     bf <- BamFile(fl)
>     si <- seqinfo(bf)
>
> which told me that I have two short sequences
>
>   > si
> Seqinfo of length 2
> seqnames seqlengths isCircular genome
> seq1           1575         NA   <NA>
> seq2           1584         NA   <NA>
>
> Suppose some 'regions of interest' were on 'seq1', e.g., windows of width 200, starting at position 100 and 1000.
>
>     roi <- GRanges("seq1", IRanges(c(100, 1000), width=200), seqinfo=si)
>
> I could calculate coverage in my BAM file, just over my regions of interest
>
>     cvg <- coverage(bf)
>
> and then create 'Views' on my regions of interest
>
>     v <- Map(Views, cvg, ranges(split(roi, seqnames(roi))))
>
>   > v
> $seq1
> Views on a 1575-length Rle subject
>
> views:
>       start  end width
> [1]   100  299   200 [ 9  9  8  9  9  8  8  8  7  7  8  8  9 10  7  6  6 ...]
> [2]  1000 1199   200 [35 36 38 38 39 40 40 39 39 39 40 39 40 43 43 45 44 ...]
>
> $seq2
> Views on a 1584-length Rle subject
>
> views: NONE
>
> You could then access elements of this as, e.g., v$seq1[1] so for a playful animation
>
>     for (i in seq_along(v$seq1)) {
>        plot(v$seq1[[i]], type="l")
>        Sys.sleep(2)
>     }
>
> If using the 'devel' (to be 3.0) version of R, coverage(bf) takes an argument
> param=ScanBamParam(which=roi) which would just calculate coverage of (reads that
> overlap) your regions of interest; this would obviously save a lot of computation if you had just a few regions, or regions on a single chromosome.
>
> Martin
>
> On 01/21/2013 01:29 PM, Maintainer wrote:
>> Hi Martin,
>>
>> I am a new beginner Rsamtools user. I know that's really a powerful
>> tool. Can you give me a sample to calculate the Read Depth with Rsamtools?
>>
>> My input is a bam file(s) and a series regions like:
>>
>> Y:2595298-2595418
>>
>> Y:2601341-2601461
>>
>> Y:2606003-2606363
>>
>> ........................
>>
>> I need to calculate the *Read depth* for every regions above. Not the read count.
>>
>> Thanks,
>>
>> Renjie
>>
>>
>>
>> ______________________________________________________________________
>> __
>> devteam-bioc mailing list
>> To unsubscribe from this mailing list send a blank email to
>> devteam-bioc-leave at lists.fhcrc.org
>> You can also unsubscribe or change your personal options at
>> https://lists.fhcrc.org/mailman/listinfo/devteam-bioc
>>
>
>
> --
> Computational Biology / Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N.
> PO Box 19024 Seattle, WA 98109
>
> Location: Arnold Building M1 B861
> Phone: (206) 667-2793
>

-- 
Dr. Martin Morgan, PhD
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109