[BioC] GenomicAlignments - Speeding up readGAlignmentPairs()

Hervé Pagès hpages at fhcrc.org
Tue Jul 15 00:06:07 CEST 2014


Hi Fong,

On 07/14/2014 01:02 PM, Fong Chun Chan wrote:
> As promised, I put some of the run-time details for a particular bam file
> comparing readGAlignments vs. readGAlignmentsPairs.
>
> readGAlignmentPairs has been running for like over a week I believe and
> doesn't look like it will actually stop running...so I don't have any
> run-time of this function unfortunately.

Could it be that the R process is eating up all your memory so now
it's running in "swapping mode"? This tends to slow things down *a lot*
(typically by a factor of 100 or 1000 or more). You can check that
by running the Unix commend 'top' in a separate terminal.

How much RAM do you have? You probably need at least 16Gb of RAM
(rough estimate) to read in a BAM file with 55 million pairs with
readGAlignmentPairs(). And possibly more than that if you're
loading the QNAME field (with 'use.names=TRUE') and/or other BAM
fields or tags.

Cheers,
H.

>
> ------
> group |    nb of |    nb of | mean / max
>                                     of |  records |   unique | records per
>                                records | in group |   QNAMEs | unique QNAME
> All records........................ A |110625579 | 55302565 |    2 / 4
>    o template has single segment.... S |        0 |        0 |   NA / NA
>    o template has multiple segments. M |110625579 | 55302565 |    2 / 4
>        - first segment.............. F | 55312870 | 55302565 |    1 / 2
>        - last segment............... L | 55312709 | 55302565 |    1 / 2
>        - other segment.............. O |        0 |        0 |   NA / NA
>
> Note that (S, M) is a partitioning of A, and (F, L, O) is a partitioning of
> M.
> Indentation reflects this.
> [1] "Read using readGAlignments..."
> Unit: seconds
>
>       expr
>   reads <- readGAlignments(bamFile, param = ScanBamParam(which = gr,
>   flag = scf))
>        min       lq   median       uq      max neval
>   209.1448 216.8919 221.3251 225.7923 243.3946   100
>
> [1] "Read using readGAlignmentsPairs..."
> ------
>
>
> On Thu, Jul 3, 2014 at 12:11 PM, Valerie Obenchain <vobencha at fhcrc.org>
> wrote:
>
>> fyi the test files ('fls') came from,
>>
>> library(RNAseqData.HNRNPC.bam.chr14)
>> fls <- RNAseqData.HNRNPC.bam.chr14_BAMFILES
>>
>> Valerie
>>
>>
>>
>> On 07/03/2014 10:42 AM, Valerie Obenchain wrote:
>>
>>> No, the division by 2 isn't always safe. 'ex1.bam' is a well-behaved
>>> file. Files with more diversity may have more than 2 segments per
>>> template.
>>>
>>> counts <- lapply(fls, function(fl)
>>>     countBam(fl, param=param)$records)
>>>
>>> pairs <- lapply(fls, function(fl)
>>>     length(readGAlignmentPairs(fl, param=param)))
>>>
>>>   unlist(counts)/2
>>>>>
>>>> ERR127306 ERR127307 ERR127308 ERR127309 ERR127302 ERR127303 ERR127304
>>>> ERR127305
>>>>     363326    392420    397158    349970    355376    381493
>>>> 389176    355834
>>>>
>>>>> unlist(pairs)
>>>>>
>>>> ERR127306 ERR127307 ERR127308 ERR127309 ERR127302 ERR127303 ERR127304
>>>> ERR127305
>>>>     363147    392252    396933    349822    355215    381274
>>>> 388991    355682
>>>>
>>>
>>> To see what's going on use readGAlignementsList() on the first file.
>>> This function adds some flexibility in that it reads in fragments, pairs
>>> w/ no mates, etc.
>>>
>>> galist <- readGAlignmentsList(fl, param=param)
>>>
>>> The length matches that of countBam() because it reads all records.
>>>
>>>   length(galist)
>>>>>
>>>> [1] 363226
>>>>
>>>
>>> 'mate_status' categorizes records by 'mated', 'ambiguous' and 'unmated'.
>>> Here we see the 363147 match to output from readGAlignmentsPairs() and
>>> the rest are ambiguous.
>>>
>>>   table(mcols(galist)$mate_status)
>>>>>
>>>>
>>>>      mated ambiguous   unmated
>>>>     363147        79         0
>>>>
>>>
>>> elementLengths() shows the number of segments per template. Again we see
>>> the 363147 match and it shows exactly 2 segments for those.
>>>
>>>   table(elementLengths(galist))
>>>>>
>>>>
>>>>       2      4      6      8
>>>> 363147     65      7      7
>>>>
>>>
>>>
>>> Valerie
>>>
>>>
>>> On 07/03/2014 10:08 AM, Fong Chun Chan wrote:
>>>
>>>> Thanks for the reply. I understand what you mean about counting aligned
>>>> pairs within a region and how that would give incorrect results.
>>>>
>>>> However, if I was purely interested in just getting all aligned
>>>> read-pairs within the whole genomic space (no restriction to any region
>>>> since this is what I need to normalize by), wouldn't the following code
>>>> work?
>>>>
>>>> library('Rsamtools')
>>>> library('GenomicAlignments')
>>>>
>>>> fl <- system.file("extdata", "ex1.bam", package = "Rsamtools")
>>>>
>>>> flags <- scanBamFlag(isPaired = TRUE,
>>>>                        isProperPair = TRUE,
>>>>                        isUnmappedQuery = FALSE,
>>>>                        hasUnmappedMate = FALSE)
>>>>
>>>> param <- ScanBamParam(flag = flags)
>>>>
>>>> countBam(fl, param = param)[['records']]/2
>>>> [1]1572
>>>>
>>>> galp <- readGAlignmentPairs(fl, param=param)
>>>> length(galp)
>>>> [1]1572
>>>>
>>>>
>>>> On Thu, Jul 3, 2014 at 9:40 AM, Valerie Obenchain <vobencha at fhcrc.org
>>>> <mailto:vobencha at fhcrc.org>> wrote:
>>>>
>>>>      Correction below.
>>>>
>>>>
>>>>      On 07/03/2014 09:28 AM, Valerie Obenchain wrote:
>>>>
>>>>          Hi,
>>>>
>>>>          On 07/02/2014 12:56 PM, Fong Chun Chan wrote:
>>>>
>>>>              Hi Valerie,
>>>>
>>>>              Thanks for the reply. I'll post some runtime as soon as it
>>>>              finishes
>>>>              running the readGAlignments and readGAlignmentPairs.
>>>>
>>>>              Thanks for the tip on passing range(allExons) into the which
>>>>              param.
>>>>              There is only one issue in that if I wanted to then generate
>>>>              say RPKM
>>>>              values where I need to know the "total number of aligned
>>>>              reads" in a bam
>>>>              file. Because I've read in just a subset of the bam (i.e.
>>>>              the reads that
>>>>              aligned just to exons) I would have technically produced an
>>>>              incorrection
>>>>              calculation of RPKM values (assuming you normalize using the
>>>>              number of
>>>>              aligned reads). On that note, is there a quick and easy way
>>>>              to just
>>>>              count the number of aligned reads or aligned paired reads so
>>>>              I can
>>>>              quickly normalize by this way. This way I can still use the
>>>>              which
>>>>              parameter in readGAlignments and readGAlignmentPairs to
>>>>              speed it up?
>>>>
>>>>
>>>>          For single-end you can use countBam():
>>>>
>>>>            > fl <- system.file("extdata", "ex1.bam", package =
>>>> "Rsamtools")
>>>>            > param <- ScanBamParam(flag = scanBamFlag(isUnmappedQuery =
>>>>          FALSE))
>>>>            > countBam(fl, param = param)
>>>>              space start end width    file records nucleotides
>>>>          1    NA    NA  NA    NA ex1.bam    3271      115286
>>>>
>>>>          It's not so clear for paired-end. The mate-pairing algorithm
>>>>          used in the
>>>>          readGAlignment* functions is described on the man page and uses a
>>>>          combination of BAM fields and flags. You can't really get at
>>>>          this simply
>>>>          by creating a param with a flag combination such as
>>>>
>>>>          isPaired = TRUE
>>>>          isProperPair = TRUE
>>>>          isUnmappedQuery = FALSE
>>>>          hasUnmappedMate = FALSE
>>>>
>>>>          For example, if you're interested in this region
>>>>
>>>>          gr <- GRanges("seq1", IRanges(100, 500))
>>>>
>>>>          and set flags,
>>>>
>>>>          flags <- scanBamFlag(isPaired = TRUE,
>>>>                                 isProperPair = TRUE,
>>>>                                 isUnmappedQuery = FALSE,
>>>>                                 hasUnmappedMate = FALSE)
>>>>
>>>>          countBam() reports 335 records. These are individual records so
>>>>          the mate
>>>>          pairs (parts) are each counted separately. Only records that
>>>>          fall within
>>>>          the region defined by 'gr' are counted.
>>>>
>>>>            > param <- ScanBamParam(which = gr, flag = flags)
>>>>            > countBam(fl, param = param)
>>>>              space start end width    file records nucleotides
>>>>          1  seq1   100 500   401 ex1.bam     335       11785
>>>>
>>>>
>>>>          readGAlignmentPairs() reports 223 pairs (446 individual
>>>>          records). All
>>>>          records in 'gr' are mated if possible. In the case where one
>>>>          mate part
>>>>          falls inside 'gr' and one outside, the other mate will be
>>>> retrieved
>>>>          (i.e., outside the range of 'gr'). This makes the 335 above
>>>>          confusing
>>>>          for counting the number of aligned pairs in the region. The
>>>>          number is a
>>>>          mixture of (1) mate parts both in 'gr' that belong together
>>>> (2) mate
>>>>          parts with mate outside 'gr' and (3) mates or mate parts that
>>>>          don't meet
>>>>          other mate-pairing criteria.
>>>>
>>>>            > galp <- readGAlignmentPairs(fl, param=param)
>>>>            > length(galp)
>>>>          [1] 223
>>>>
>>>>          Off hand I can't think of a good way to count aligned pairs
>>>> w/in a
>>>>          specified region.
>>>>
>>>>
>>>>      What I should have said here is I can't think of a good way to
>>>>      accurately count aligned pairs (1 count per pair) in a BAM file w/o
>>>>      using the pairing algo.
>>>>
>>>>      Valerie
>>>>
>>>>
>>>>
>>>>              Also another thing I notice is that when I use the
>>>>              readGAlignmentPairs function, I lose my ability to cancel
>>>>              the command
>>>>              (ctrl-c no longer works).
>>>>
>>>>
>>>>          It's implemented in C/C++. You can try ctrl+\ in the session
>>>> or kill
>>>>          with the top command from another terminal. Maybe others have
>>>>          suggestions they'll add ...
>>>>
>>>>          Valerie
>>>>
>>>>          Not sure this is a bug, or if it is something
>>>>
>>>>              to do with the internal workings of the function. The loss
>>>>              of ctrl-c
>>>>              also occurs when trying to run as a Rscript. I can seemingly
>>>>              ctrl-c
>>>>              anytime before, but it hits this step it just refuses to
>>>> cancel.
>>>>
>>>>              Thanks,
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>              On Tue, Jul 1, 2014 at 10:06 AM, Valerie Obenchain
>>>>              <vobencha at fhcrc.org <mailto:vobencha at fhcrc.org>
>>>>              <mailto:vobencha at fhcrc.org <mailto:vobencha at fhcrc.org>>>
>>>> wrote:
>>>>
>>>>                   Hi,
>>>>
>>>>                   You don't need to call findMateAlignment() when using
>>>>                   readGAlignmentPairs(). In a previous iteration,
>>>>                   readGAlignmentPairs() called findMateAlignment() but
>>>>              this is no
>>>>                   longer the case.
>>>>
>>>>                   readGAlignmentPairs() does have more overhead than
>>>>              readGAlignments()
>>>>                   because of the mate paring. These numbers are for a BAM
>>>>              file with
>>>>                   800484 records (400054 final pairs) on my not-so-speedy
>>>>              laptop.
>>>>
>>>>                   library(microbenchmark)
>>>>                   library(RNAseqData.HNRNPC.bam.____chr14)
>>>>                   fls <- RNAseqData.HNRNPC.bam.chr14_____BAMFILES
>>>>
>>>>                             microbenchmark(____readGAlignments(fls[1]),
>>>>              times=5)
>>>>
>>>>                       Unit: seconds
>>>>                                            expr      min       lq  median
>>>>                     uq
>>>>                         max neval
>>>>                         readGAlignments(fls[1]) 1.753653 1.788292 1.85527
>>>>              2.019071
>>>>                       2.121211     5
>>>>
>>>>
>>>>
>>>>                   ## mate pairing
>>>>
>>>>
>>>>              microbenchmark(____readGAlignmentPairs(fls[1]), times=5)
>>>>
>>>>                       Unit: seconds
>>>>                                                expr      min       lq
>>>>              median
>>>>                       uq      max neval
>>>>                         readGAlignmentPairs(fls[1]) 9.283966 9.300328
>>>>              9.535194
>>>>                       9.843282 11.37441     5
>>>>
>>>>
>>>>                   ## mate pairing
>>>>
>>>>
>>>>              microbenchmark(____readGAlignmentsList(fls[1]), times=5)
>>>>
>>>>                       Unit: seconds
>>>>                                                expr      min       lq
>>>>              median
>>>>                       uq      max neval
>>>>                         readGAlignmentsList(fls[1]) 8.409743 8.457232
>>>>              8.803696
>>>>                       9.692845 9.867628     5
>>>>
>>>>
>>>>
>>>>                   You can set a 'yieldSize' when creating the BamFile.
>>>>              This will chunk
>>>>                   through the file 'yieldSize' records at a time and is
>>>>              useful when
>>>>                   processing a complete file and when there are memory
>>>>              limitations.
>>>>
>>>>                      bf <- BamFile(myfile, yieldSize = 1000000)
>>>>
>>>>                   Herve reported some additional timings for
>>>>              readGAlignmentPairs() on
>>>>                   this post:
>>>>
>>>>
>>>> https://stat.ethz.ch/____pipermail/bioconductor/2014-____May/059695.html
>>>>
>>>> <https://stat.ethz.ch/__pipermail/bioconductor/2014-__May/059695.html>
>>>>
>>>>
>>>> <https://stat.ethz.ch/__pipermail/bioconductor/2014-__May/059695.html
>>>>
>>>> <https://stat.ethz.ch/pipermail/bioconductor/2014-May/059695.html>>
>>>>
>>>>
>>>>                   Can you provide some numbers for how long your run is
>>>>              taking and the
>>>>                   number of records you're trying to read in?
>>>>
>>>>                   I assume 'allExons' is used for counting against the
>>>>              BAM files in
>>>>                   later code? Maybe this is why your were looking into
>>>>                   summarizeOverlaps(). I should point out that if
>>>> you're only
>>>>                   interested in the regions overlapping 'allExons' you
>>>>              could call
>>>>                   range(allExons) and use that as the 'which' in the
>>>>              param. This would
>>>>                   focus the area of interest and speed up the reading.
>>>>
>>>>                   Valerie
>>>>
>>>>
>>>>
>>>>
>>>>                   On 06/30/14 12:22, Fong Chun Chan wrote:
>>>>
>>>>                       Hi,
>>>>
>>>>                       I've been using the GenomicFeatures R package to
>>>>              read in some
>>>>                       RNA-seq
>>>>                       paired-end read data. While the readGAlignments()
>>>>              function reads
>>>>                       in the bam
>>>>                       file within a minute, I've noticed that the
>>>>                       readGAlignmentPairs() function
>>>>                       is extremely slow.
>>>>
>>>>                       This is even after restricting the space to just a
>>>>              single
>>>>                       chromosome along
>>>>                       with setting the flags isDuplicate = FALSE and
>>>>              isNotPrimaryRead
>>>>                       = FALSE
>>>>                       (the latter being suggested in the reference):
>>>>
>>>>                       "An easy way to avoid this degradation is to load
>>>>              only primary
>>>>                       alignments
>>>>                       by setting the isNotPrimaryRead ï¬'ag to FALSE in
>>>>              ScanBamParam()"
>>>>
>>>>                       Based on my understanding, it seems that the ï¬
>>>>                       ndMateAlignment() function
>>>>
>>>>                       has to be run first. Is there any other way to
>>>>              speed it up?
>>>>                       Perhaps I am
>>>>                       doing something wrong here?
>>>>
>>>>                       Code and sessionInfo() below.
>>>>
>>>>                       Thanks!
>>>>
>>>>                       ----
>>>>                       library("GenomicFeatures")
>>>>                       library('GenomicAlignments')
>>>>                       library('Rsamtools')
>>>>                       si <- seqinfo(BamFile(arguments$____bamFile))
>>>>                       gr <- GRanges(arguments$chr, IRanges(100,
>>>>                       seqlengths(si)[[arguments$chr]____]-100))
>>>>                       allExons <- exons(txdb, columns = c('gene_id',
>>>>              'exon_id',
>>>>                       'exon_name'))
>>>>                       scf <- scanBamFlag( isDuplicate = FALSE,
>>>>              isNotPrimaryRead =
>>>>                       FALSE ) #
>>>>                       remove duplicate reads, remove non-primary reads
>>>> (i.e.
>>>>                       multi-aligned reads)
>>>>                       reads <- readGAlignmentPairs( arguments$bamFile,
>>>>              param =
>>>>                       ScanBamParam(
>>>>                       which = gr, flag = scf ) ); # grab reads in
>>>>              specific region
>>>>
>>>>                           sessionInfo()
>>>>
>>>>                       R version 3.1.0 (2014-04-10)
>>>>                       Platform: x86_64-unknown-linux-gnu (64-bit)
>>>>
>>>>                       locale:
>>>>                          [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>>>>                       LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>>>>                       LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
>>>>                         LC_PAPER=en_US.UTF-8
>>>>                               LC_NAME=C                  LC_ADDRESS=C
>>>>                       LC_TELEPHONE=C
>>>>                                     LC_MEASUREMENT=en_US.UTF-8
>>>>              LC_IDENTIFICATION=C
>>>>
>>>>                       attached base packages:
>>>>                       [1] parallel  stats     graphics  grDevices utils
>>>>                 datasets
>>>>                         methods
>>>>                       base
>>>>
>>>>                       other attached packages:
>>>>                          [1] argparse_1.0.1          proto_0.3-10
>>>>                          GenomicAlignments_1.0.1 BSgenome_1.32.0
>>>>              Rsamtools_1.16.1
>>>>                          Biostrings_2.32.0       XVector_0.4.0
>>>>                       GenomicFeatures_1.16.2
>>>>                          AnnotationDbi_1.26.0    Biobase_2.24.0
>>>>                         GenomicRanges_1.16.3
>>>>                          GenomeInfoDb_1.0.2      IRanges_1.22.9
>>>>                         BiocGenerics_0.10.0
>>>>                       vimcom.plus_0.9-93
>>>>                       [16] setwidth_1.0-3          colorout_1.0-2
>>>>
>>>>                       loaded via a namespace (and not attached):
>>>>                          [1] BatchJobs_1.2      BBmisc_1.7
>>>>              BiocParallel_0.6.1
>>>>                       biomaRt_2.20.0     bitops_1.0-6       brew_1.0-6
>>>>                       checkmate_1.0
>>>>                          codetools_0.2-8    DBI_0.2-7
>>>> digest_0.6.4
>>>>              fail_1.2
>>>>                           findpython_1.0.1   foreach_1.4.2
>>>> getopt_1.20.0
>>>>                         iterators_1.0.7
>>>>                            plyr_1.8.1         Rcpp_0.11.2
>>>>                RCurl_1.95-4.1
>>>>                       [19] rjson_0.2.14       RSQLite_0.11.4
>>>>              rtracklayer_1.24.2
>>>>                       sendmailR_1.1-2    stats4_3.1.0       stringr_0.6.2
>>>>              tools_3.1.0
>>>>                          XML_3.98-1.1       zlibbioc_1.10.0
>>>>
>>>>
>>>>
>>>>                                [[alternative HTML version deleted]]
>>>>
>>>>
>>>>
>>>>                       ___________________________________________________
>>>>                       Bioconductor mailing list
>>>>              Bioconductor at r-project.org
>>>>              <mailto:Bioconductor at r-project.org>
>>>>              <mailto:Bioconductor at r-__project.org
>>>>              <mailto:Bioconductor at r-project.org>>
>>>>              https://stat.ethz.ch/mailman/____listinfo/bioconductor
>>>>              <https://stat.ethz.ch/mailman/__listinfo/bioconductor>
>>>>
>>>>              <https://stat.ethz.ch/mailman/__listinfo/bioconductor
>>>>              <https://stat.ethz.ch/mailman/listinfo/bioconductor>>
>>>>                       Search the archives:
>>>>
>>>>
>>>> http://news.gmane.org/gmane.____science.biology.informatics.
>>>> ____conductor
>>>>
>>>> <http://news.gmane.org/gmane.__science.biology.informatics.__conductor>
>>>>
>>>>
>>>> <http://news.gmane.org/gmane.__science.biology.informatics.__conductor
>>>>
>>>> <http://news.gmane.org/gmane.science.biology.informatics.conductor>>
>>>>
>>>>
>>>>
>>>>                   --
>>>>                   Valerie Obenchain
>>>>                   Program in Computational Biology
>>>>                   Fred Hutchinson Cancer Research Center
>>>>                   1100 Fairview Ave. N, Seattle, WA 98109
>>>>
>>>>                   Email: vobencha at fhcrc.org <mailto:vobencha at fhcrc.org>
>>>>              <mailto:vobencha at fhcrc.org <mailto:vobencha at fhcrc.org>>
>>>>                   Phone: (206) 667-3158 <tel:%28206%29%20667-3158>
>>>>              <tel:%28206%29%20667-3158>
>>>>
>>>>
>>>>
>>>>          _________________________________________________
>>>>          Bioconductor mailing list
>>>>          Bioconductor at r-project.org <mailto:Bioconductor at r-project.org>
>>>>          https://stat.ethz.ch/mailman/__listinfo/bioconductor
>>>>          <https://stat.ethz.ch/mailman/listinfo/bioconductor>
>>>>          Search the archives:
>>>>
>>>> http://news.gmane.org/gmane.__science.biology.informatics.__conductor
>>>>
>>>> <http://news.gmane.org/gmane.science.biology.informatics.conductor>
>>>>
>>>>
>>>>
>>>>
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at r-project.org
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives:
>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>
>>
>>
>
> 	[[alternative HTML version deleted]]
>
>
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319



More information about the Bioconductor mailing list