[BioC] Finding my un-counted reads

Thu Aug 29 00:31:38 CEST 2013

Hi,

On Wed, Aug 28, 2013 at 3:15 PM, Valerie Obenchain <vobencha at fhcrc.org> wrote:
> Hi Sam,
>
> ## counting reads
>
> If you haven't done this already I'd confirm counts on a single file. A bam
> can be read in with readGAlignments(). If you have paired-end data you can
> use ?readGAlignmentPairs or ?readGAlignmentsList.
>
>     bf <- BamFile("pathToOneBam")
>     reads <- readGAlignments(bf) ## devel
>     reads <- readGappedAlignments(bf) ## release
>
> Take a look at the makeTranscriptDbFromGFF() function in GenomicFeatures. It
> creates a TranscriptDb object from gff or gtf. The advantage of working with
> a TxDb is that you get the accessors such as cdsBy(..., "gene"), etc. for
> free (you may know this, just an fyi).

UTRs are exons, too!

> Get an idea of how many reads hit each list element of the GRangesList. This
> method does double count so it's possible for the sum of 'olap' to be
> greater than the length of 'reads'.
>
>     olap <- countOverlaps(cds_gene, reads)
>
> If the numbers in 'olap' are very small (and this is unexpected) you may
> have a problem with how you've created the 'cds_gene' annotation. If the
> numbers make sense then try summarizeOverlaps.
>
>     se <- summarizeOverlaps(cds_gene, bf, mode="IntersectionNotEmpty")
>
> If the number of hits in assays(se)$counts are greatly reduced you may have
> many reads overlapping each other or a single feature in 'cds_gene'. This
> can be checked by looking at the numbers in 'olap'. If you are using the
> devel version (GenomicRanges >= 1.13.9) the 'inter.feature' argument allows
> reads to be counted for each feature they overlap.
>
>
> ## mapped vs unmapped reads

I don't think the issue was mapped vs. unmapped, but rather "genic" vs
"intergenic".

If the OP took the TxDb route, a first approximation to the number of
intergenic reads could be counted by doing something like:

    tx <- transcripts(txdb)
    strand(tx) <- '*';
    txs <- reduce(tx)

    o <- countOverlaps(txs, reads, ignore.strand=TRUE)

... or something similar. This treats intronic reads as valid as reads
that map to exons, but again -- a first approximation to answer the
"DNA contamination". If you want to include intronic reads as "bad"
reads, then do `exons(txdb)` instead of `transcripts(txdb)` and
proceed as above.

HTH,
-steve

-- 
Steve Lianoglou
Computational Biologist
Bioinformatics and Computational Biology
Genentech