[BioC] HT-seq counting - gene vs isoform

Wed Dec 5 19:47:25 CET 2012

Thank you very much for your response Simon.

Best Regards,
Nirmala

------------------------------------------------------------------------------------------------------------------------------
Contractor
Buiding 35, Room 1A-205
35 Convent Drive,
National Institute of Mental Health/NIH
Bethesda
MD - 20892
Phone# 301-451-4258

-----Original Message-----
From: Simon Anders [mailto:anders at embl.de] 
Sent: Wednesday, December 05, 2012 5:36 AM
To: bioconductor at r-project.org
Subject: Re: [BioC] HT-seq counting - gene vs isoform

Hi

On 04/12/12 23:46, Akula, Nirmala (NIH/NIMH) [C] wrote:
> When counting at gene level, I assume the reads that fall on all exons (Exon1, Exon2, Exon3 and Exon4) are all summed up for GeneA.
>
> When counting at isoform level,
>
> GeneA_isoform1 - is it sum of exons from Exon1, Exon2 and Exon3 (or) just reads that map to Exon2?
> GeneA_isoform2 - is it sum of Exon1 and Exon3 (or) no counts because its exons are common with isoform1 and isoform3?
> GeneA_isoform3 - sum of Exon1 and Exon4 (or) only Exon4?

Always the latter. This is why htseq-count is not suitable to count at isoform level.

To explain the rationale behind this:

HTSeq-count is meant to be used for differential expression analysis; hence the rule that ambiguous mappings are discarded. Consider two genes that share part of their sequence, one of them being differentially expressed, the other not. If we count reads mapping to the shared part (and hence to both genes), we will wrongly conclude that they are _both_ differentially expressed. If we discard the reads mapping to the shared part, we underestimate both genes' expression but we do so by the same fraction in all samples so that any inference about expression changes is still correct.

For counting at gene level, we can afford to discard the rather few reads that map to shared sequence. (With long reads, there is few such stretches longer than the read length even between paralogs.) For isoforms, this becomes untenable, and hence, any attempt of inferring differential expression at the isoform level is bound to fail if it is based on simple counting.

Instead, one should either use some method based on Bayesian inference (e.g. BitSeq) or perform the inference on the exon level (our DEXSeq approach). See our paper for a discussion why the prefer the latter and see Glaus et al.'s paper to learn more about the former.

   Simon

_______________________________________________
Bioconductor mailing list
Bioconductor at r-project.org
https://stat.ethz.ch/mailman/listinfo/bioconductor
Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor