[BioC] countOverlaps Warnings

Mon Apr 21 22:32:58 CEST 2014

Hi Pankaj,

There are several options for counting paired-end reads.

Both readGAlignmentPairs() and readGAlignmentsList() are appropriate for 
paired-end data and are in the GenomicAlignments package. They use the 
same counting algorithm but return the data in different containers. 
readGAlignmentPairs() returns only pairs while readGAlignmentsList() 
returns pairs as well as singletons, reads with unmapped pairs etc. See 
the ?readGAlignments man page for details.

Another counting function is summarizeOverlaps() which uses the above 
functions 'under the hood'. It returns the counts in the 'assays' slot 
of a SummarizedExperiment object. Since you have multiple bam files to 
count you may want to use the BamFileList method.

The man page has examples of counting paired-end bam files.
library(Rsamtools)
?summarizeOverlaps

Other packages that offer count functions are Rsubread and easyRNASeq.

Valerie

On 04/21/2014 09:53 AM, Pankaj Agarwal wrote:
> Hi,
>
> I am analyzing a matched pair tumor/normal rna-seq data set.  After aligning with bowtie2 against UCSC hg19 index, I am trying to get the counts for each of the samples using the iGenomes UCSC GTF file.  I came across the following tutorial which shows different ways to do it in Bioconductor:
> http://faculty.ucr.edu/~tgirke/HTML_Presentations/Manuals/Workshop_Dec_12_16_2013/Rrnaseq/Rrnaseq.pdf
>
> Following along slide 28 "Read Counting with countOverlaps", I am able to generate the counts but get the following Warnings.  I just want to be sure that there is nothing to be worried about because I don't understand the meaning of these warnings.  My code is as follows:
>
>> library(GenomicFeatures)
> Loading required package: AnnotationDbi
>> library(Rsamtools)
>> library(rtracklayer)
>> txdb <- loadDb("./data/ucsc_igenomes_hg19_gtf.sqlite")
>> eByg <- exonsBy(txdb, by="gene")
>> head(eByg)
> GRangesList of length 6:
> $A1BG
> GRanges with 8 ranges and 2 metadata columns:
>        seqnames               ranges strand |   exon_id   exon_name
>           <Rle>            <IRanges>  <Rle> | <integer> <character>
>    [1]    chr19 [58858172, 58858395]      - |    163177        <NA>
>    [2]    chr19 [58858719, 58859006]      - |    163178        <NA>
>    [3]    chr19 [58861736, 58862017]      - |    163179        <NA>
>    [4]    chr19 [58862757, 58863053]      - |    163180        <NA>
>    [5]    chr19 [58863649, 58863921]      - |    163181        <NA>
>    [6]    chr19 [58864294, 58864563]      - |    163182        <NA>
>    [7]    chr19 [58864658, 58864693]      - |    163183        <NA>
>    [8]    chr19 [58864770, 58864865]      - |    163184        <NA>
>
> $A1BG-AS1
> GRanges with 4 ranges and 2 metadata columns:
>        seqnames               ranges strand | exon_id exon_name
>    [1]    chr19 [58863336, 58864410]      + |  156640      <NA>
>    [2]    chr19 [58864745, 58864840]      + |  156641      <NA>
>    [3]    chr19 [58865080, 58865223]      + |  156642      <NA>
>    [4]    chr19 [58865735, 58866549]      + |  156643      <NA>
>
> ...
> <4 more elements>
> ---
> seqlengths:
>                   chr12                  chr8 ...  chr1_gl000192_random
>                      NA                    NA ...                    NA
>> samples = c("BRPC13-1118_L1.D710_501.sorted", "BRPC_13-764_L2.sorted", "DU04_PBMC_RNA_L1.sorted", "DU06_PBMC_RNA_L1.sorted")
>>
>> samples
> [1] "BRPC13-1118_L1.D710_501.sorted" "BRPC_13-764_L2.sorted"
> [3] "DU04_PBMC_RNA_L1.sorted"        "DU06_PBMC_RNA_L1.sorted"
>>
>> samplespath <- paste("./", samples, ".bam", sep="")
>>
>> samplespath
> [1] "./BRPC13-1118_L1.D710_501.sorted.bam"
> [2] "./BRPC_13-764_L2.sorted.bam"
> [3] "./DU04_PBMC_RNA_L1.sorted.bam"
> [4] "./DU06_PBMC_RNA_L1.sorted.bam"
>>
>> names(samplespath) <- samples
>>
>> samplespath
>          BRPC13-1118_L1.D710_501.sorted                  BRPC_13-764_L2.sorted
> "./BRPC13-1118_L1.D710_501.sorted.bam"          "./BRPC_13-764_L2.sorted.bam"
>                 DU04_PBMC_RNA_L1.sorted                DU06_PBMC_RNA_L1.sorted
>         "./DU04_PBMC_RNA_L1.sorted.bam"        "./DU06_PBMC_RNA_L1.sorted.bam"
>>
>> countDF <- data.frame(row.names=names(eByg))
>>
>> countDF
> data frame with 0 columns and 23710 rows
>>
>> dim(countDF)
> [1] 23710     0
>>
>> for(i in samplespath) {
> + aligns <- readGAlignmentsFromBam(i)
> + counts <- countOverlaps(eByg, aligns, ignore.strand=TRUE)
> + countDF <- cbind(countDF, counts)
> + }
> Warning messages:
> 1: In .Seqinfo.mergexy(x, y) :
>    Each of the 2 combined objects has sequence levels not in the other:
>    - in 'x': chr6_cox_hap2, chr6_dbb_hap3, chr6_ssto_hap7, chr6_qbl_hap6, chr6_mann_hap4, chr6_mcf_hap5, chr6_apd_hap1, chrUn_gl000228, chr17_ctg5_hap1, chr19_gl000209_random, chr4_ctg9_hap1, chrUn_gl000222, chrUn_gl000223, chr1_gl000191_random, chr7_gl000195_random, chrUn_gl000213, chrUn_gl000220, chr17_gl000205_random, chrUn_gl000212, chrUn_gl000218, chrUn_gl000219, chrUn_gl000211, chr4_gl000193_random, chr4_gl000194_random, chr1_gl000192_random
>    - in 'y': chrM
>    Make sure to always combine/compare objects based on the same reference
>    genome (use suppressWarnings() to suppress this warning).
> 2: In .Seqinfo.mergexy(x, y) :
>    Each of the 2 combined objects has sequence levels not in the other:
>    - in 'x': chr6_cox_hap2, chr6_dbb_hap3, chr6_ssto_hap7, chr6_qbl_hap6, chr6_mann_hap4, chr6_mcf_hap5, chr6_apd_hap1, chrUn_gl000228, chr17_ctg5_hap1, chr19_gl000209_random, chr4_ctg9_hap1, chrUn_gl000222, chrUn_gl000223, chr1_gl000191_random, chr7_gl000195_random, chrUn_gl000213, chrUn_gl000220, chr17_gl000205_random, chrUn_gl000212, chrUn_gl000218, chrUn_gl000219, chrUn_gl000211, chr4_gl000193_random, chr4_gl000194_random, chr1_gl000192_random
>    - in 'y': chrM
>    Make sure to always combine/compare objects based on the same reference
>    genome (use suppressWarnings() to suppress this warning).
> 3: In .Seqinfo.mergexy(x, y) :
>    Each of the 2 combined objects has sequence levels not in the other:
>    - in 'x': chr6_cox_hap2, chr6_dbb_hap3, chr6_ssto_hap7, chr6_qbl_hap6, chr6_mann_hap4, chr6_mcf_hap5, chr6_apd_hap1, chrUn_gl000228, chr17_ctg5_hap1, chr19_gl000209_random, chr4_ctg9_hap1, chrUn_gl000222, chrUn_gl000223, chr1_gl000191_random, chr7_gl000195_random, chrUn_gl000213, chrUn_gl000220, chr17_gl000205_random, chrUn_gl000212, chrUn_gl000218, chrUn_gl000219, chrUn_gl000211, chr4_gl000193_random, chr4_gl000194_random, chr1_gl000192_random
>    - in 'y': chrM
>    Make sure to always combine/compare objects based on the same reference
>    genome (use suppressWarnings() to suppress this warning).
> 4: In .Seqinfo.mergexy(x, y) :
>    Each of the 2 combined objects has sequence levels not in the other:
>    - in 'x': chr6_cox_hap2, chr6_dbb_hap3, chr6_ssto_hap7, chr6_qbl_hap6, chr6_mann_hap4, chr6_mcf_hap5, chr6_apd_hap1, chrUn_gl000228, chr17_ctg5_hap1, chr19_gl000209_random, chr4_ctg9_hap1, chrUn_gl000222, chrUn_gl000223, chr1_gl000191_random, chr7_gl000195_random, chrUn_gl000213, chrUn_gl000220, chr17_gl000205_random, chrUn_gl000212, chrUn_gl000218, chrUn_gl000219, chrUn_gl000211, chr4_gl000193_random, chr4_gl000194_random, chr1_gl000192_random
>    - in 'y': chrM
>    Make sure to always combine/compare objects based on the same reference
>    genome (use suppressWarnings() to suppress this warning).
>>
>
> I also wanted to verify if what I am doing is correct for paired end reads.
>
> Thanks,
>
> - Pankaj
>
>
> =======================================
> sessionInfo()
> R version 3.0.3 (2014-03-06)
> Platform: x86_64-unknown-linux-gnu (64-bit)
>
> locale:
> [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
> [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
> [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
> [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
> [9] LC_ADDRESS=C               LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base
>
> 	[[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>

-- 
Valerie Obenchain
Program in Computational Biology
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, Seattle, WA 98109

Email: vobencha at fhcrc.org
Phone: (206) 667-3158