[BioC] package to count read prior to DESeq?

Fri Feb 3 20:48:26 CET 2012

Dear Abhi,

Yes, we can certainly tweak it to do so, but I'm not sure how much sense that would make. 

First, by doing so would in end effect result in counting read several times; i.e. reads involved in the same exons of different transcripts will be accounted for several time, so you would bias your data.

Second, easyRNASeq is meant to start with an unmodified bam file, i.e. after any quality pre-processing has been done but before any kind of normalization/summarization. What easyRNASeq does is to compute the number of reads per feature (exon, transcript, gene,...) and normalize these results (or not) into RPKM or by using the DESeq or edgeR packages. Both these packages require "raw" unmodified data. 

Transforming transcript "counts" (or FPKM) into genes count is definitely not a straightforward process. You would need to be able to deconvolute every transcript effect, e.g. for a gene with two transcript A and B, you would need to establish the proportion of the different effect from the part unique to transcript A (the part not overlapping transcript B), that of transcript B and their common part. Knowing that Illumina has a significant (and somewhat hard to predict) read coverage bias along any transcript (luckily that bias is reproducible...), it makes it IMO hardly possible to achieve.

If you are interested in gene counts, I would suggest to use easyRNASeq on its own, starting from your unmodified bam files. Then depending on your experiment, e.g. if you're doing some kind of differential expression analysis, apply DESeq, baySeq or edgeR.

I hope this helps,

Cheers,

Nico

---------------------------------------------------------------
Nicolas Delhomme

Genome Biology Computational Support

European Molecular Biology Laboratory

Tel: +49 6221 387 8310
Email: nicolas.delhomme at embl.de
Meyerhofstrasse 1 - Postfach 10.2209
69102 Heidelberg, Germany
---------------------------------------------------------------

On 3 Feb 2012, at 19:51, Abhishek Pratap wrote:

> Hi Nico
> 
> With respect to easyRNAeq package I am wondering if I supply it a
> transcript level gtf file can it produce gene level counts. Basically
> the file is the output from cufflinks.
> 
> the last(attributes) column in the file does have the gene_id attribute.
> 
> 
> Best,
> -Abhi
> 
> 
> 
> On Fri, Feb 3, 2012 at 8:06 AM, Nicolas Delhomme <delhomme at embl.de> wrote:
>> Hi Nat,
>> 
>> That's what the easyRNASeq (Bioc 2.10) pacakge exactly is for. Let me know if you need any help.
>> 
>> Cheers,
>> 
>> Nico
>> 
>> ---------------------------------------------------------------
>> Nicolas Delhomme
>> 
>> Genome Biology Computational Support
>> 
>> European Molecular Biology Laboratory
>> 
>> Tel: +49 6221 387 8310
>> Email: nicolas.delhomme at embl.de
>> Meyerhofstrasse 1 - Postfach 10.2209
>> 69102 Heidelberg, Germany
>> ---------------------------------------------------------------
>> 
>> 
>> 
>> 
>> 
>> On 3 Feb 2012, at 16:39, nathalie wrote:
>> 
>>> Hi,
>>> Is there a package to create a count table from bam files, which gives an output compatible with DESeq analysis?
>>> thanks
>>> Nat
>>> 
>>> 
>>> --
>>> The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE.
>>> 
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at r-project.org
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>> 
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor