[BioC] multiple hits with countOverlaps function
    Steve Lianoglou 
    mailinglist.honeypot at gmail.com
       
    Thu Apr 14 03:56:50 CEST 2011
    
    
  
Hi,
On Wed, Apr 13, 2011 at 9:20 PM, Wei Shi <shi at wehi.edu.au> wrote:
[snip]
>> I dont get the "the second read was counted twice. "  It is the nature
>> of the problem that reads have length > 1 and they can overlap
>> multiple features and you need to thing about how you want to deal
>> with this.  I assume you are looking at HTseq data, and I cannot
>> really understand what you are trying to do.
>>
>> Kasper
>
> Let's take RNA-seq data for an example. It is known that many genes overlap with each other in the genome. If a read is mapped to a location which is shared by two genes (by two exons from the two genes respectively to be exact), then this read should not be assigned to both genes because it can only originate from one of the genes/transcripts.
Yes, it can only originate from one, but how would countOverlaps know which one?
How does anyone know which one?
As far as I know, there is no standard way to deal with this, so the
burden is on you.
If you have an idea of "what's right" in this situation, why don't you
rather ask the opposite question first, which is: "Which reads overlap
more than one feature?"
R> features.per.read <- countOverlaps(reads, features)
Then do your "naive"  counting/quantifying w/ just the "easy" case:
R> reads.per.feature <- countOverlaps(features,
features.per.read[features.per.read == 1])
Now you can figure out what you want to do for features.per.read > 1,
because picking one feature to assign the read to at random by default
is clearly wrong, right?
What if the gene/isoform that the read overlaps with isn't expressed
at all, you'd be assigning reads to it "just because" in your
scenario.
-- 
Steve Lianoglou
Graduate Student: Computational Systems Biology
 | Memorial Sloan-Kettering Cancer Center
 | Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact
    
    
More information about the Bioconductor
mailing list