[BioC] duplicate reads in mRNA-Seq

Mon Feb 14 18:16:36 CET 2011

Hi Dr. Anders and Dr. Jason, 

May I ask, what is the frequency of duplicates that you have had in your data?

I have had ~0.6 duplicates in my final aligned and filtered (unique match and number of mismatches) dataset. As of now I have run analysis without them. 

Thanks,
Fernando

-----Original Message-----
From: bioconductor-bounces at r-project.org [mailto:bioconductor-bounces at r-project.org] On Behalf Of Simon Anders
Sent: Saturday, February 12, 2011 11:39 AM
To: Jason Lu
Cc: bioconductor at stat.math.ethz.ch
Subject: Re: [BioC] duplicate reads in mRNA-Seq

Hi Jason

> It seems that the duplicate reads are very common in mRNA-seq data.
> Duplicate reads are those being mapped to exact the same chromosome 
> location and on the same strand (maybe from PCR amplification). I 
> would like to know what are the general practice to deal with it? I 
> suspect some of those may contribute to the large overdispersion in 
> the final count data.

I know it is soemtimes recommended to remove them but I'd advise against this.

One of the advantages of RNA-Seq over expression microarrays is the large gain in dynamic range. On arrays, lowly expressed genes drown in background flourescence and highly expressed genes saturate the hybridisation, giving you a dynamic range of typically little more 25 dB (i.e., ratios of up to at most 1:300).

In RNA-Seq, very weak genes give rise to less than 10 counts while the strongest genes may give more well above 100,000 counts, i.e., the usable dynamic range is now easily exceeding 45 dB or 50 dB (1:100,000).

Now, imagine you would count several reads mapping to the same position at most once. Then, a transcript of, say, 1 kB can at most accumulate 1,000 counts, even if it were one of those strongly expressed ones with 5-figure raw count. Hence, you would dramatically squash your dynamic range and lose all hope for linearity (i.e., you cannot expect any more that the count rate is at least roughly proportional to the concentration).

Of course, if there are PCR artifacts, they destroy the linearity as well.
So, if you have an exon, to which only very few reads map except for one specific position that shows a pile of hundreds of reads, all with precisely the same coordinates, then is reason for concern. I have seen such "towers" only rarely in RNA-Seq. (Actually, I haven't seem them at all recently, but I think they were a common concern two years ago. I wonder where they went. Did they maybe improve the PCR steps of the library preparation protocols?)

  Simon

_______________________________________________
Bioconductor mailing list
Bioconductor at r-project.org
https://stat.ethz.ch/mailman/listinfo/bioconductor
Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor