[BioC] Getting counts for previously undetected transcripts and genes with easyRNASeq: comparison to Cufflinks

Tue Dec 4 11:27:51 CET 2012

Hi Richard,

On Dec 3, 2012, at 10:03 PM, Richard Friedman wrote:

> Dear Bioconductor List,
> 
> 	I am working through the  easyRNAseq use case to learn how 
> to obtain counts in RNASeq experiments for further analysis.
> The example starts with BAM files and converts BAM files to Exons,
> transcripts and genes using geneModels. 
> 
> Does using geneModels only assemble previously annotated transcripts and
> genes OR can it find new ones if present?

It can only assemble gene models from the exons defined in the annotation file you provide. 

> 
> If it can find new ones how well does it do this in comparison to Cufflinks?
> 

No it can't. There's a "now old" functionality - not exposed - that identify islands of coverage, i.e. in an approach similar to a very basic approach at finding large ChIP-Seq regions (e.g. from histone marks). But that would compare very poorly with cufflinks as it does not take into account any splice-junction nor paired-read information.

> If it cannot find new ones - is there a way to get counts (as distinct from fpkm
> values) for genes and transcripts from  cufflinks and 
> relate them to existing annotation where they correspond and present them
> as non-previosuly annotated ones when they don't correspond?
> 

Yes there is - I'm doing something very similar - but not in R. More below.

> Can easyRNASeq be used for this purpose?

Not directly no.

> Can anyone recommend a tool that can be used for this purpose?

More below.

> 
> My goal is to get a set of counts that can be input into CQN and then edgeR.
> I wish to use TopHat/Cufflinks to get the Exons, transcripts, and genes including
> novel spliced variants but I am persuaded CQN is a better way to normalize than
> FPKM and edgeR is a better way to analyze differential expression than 
> Cuffdiff. 
> 

I agree with you there. What you might want to consider in addition is to use an aligner that is multi-mapping aware. RSEM is an example, tophat/cufflinks does this in a similar way by default (or so I believe, double-check that this option is really enabled by default). Here it's useful to use bowtie2 rather than bowtie. Finally, I like the idea of using CQN, I'll consider adding this to the easyRNASeq "pipeline".

As I said, I have a very similar setup, but completely de-novo. I've been (still am) testing several approaches:

1) running TopHat/Cufflinks/Cuffmerge (cuffmerge to get the exon/gene GFF) and from that I go back to the original alignments by tophat and use these as input together with the GFF for easyRNASeq. I then get my DESeq/edgeR output and proceed in R.
2) In parallel, I'm doing the same using trinity (assembling the transcriptome de novo) and re-aligning the read libraries using RSEM. From the RSEM results, I get directly the count tables that I extract as a matrix and directly input into edgeR/DESeq (bypassing easyRNASeq). 
3) Same as 2, but I create a GFF annotation by performing a re-alignments of the obtained contig to the genome assembly using GMAP. Then using the RSEM bam files, I use easyRNASeq to create the count table.
I'm still evaluating the outcome of these different approaches.

As it seems that you already have some annotation for your genome/transcriptome, that should ease some of the steps, e.g. instead of running cuffmerge, you could run cuffcompare to refine your GFF (see my approach #1). Since this discussion is not really related to Bioconductor, we could continue it offline if you like.

HTH,

Cheers,

Nico

> I would appreciate any advice.
> 
> Thanks and best wishes,
> Rich
> Richard A. Friedman, PhD
> Associate Research Scientist,
> Biomedical Informatics Shared Resource
> Herbert Irving Comprehensive Cancer Center (HICCC)
> Lecturer,
> Department of Biomedical Informatics (DBMI)
> Educational Coordinator,
> Center for Computational Biology and Bioinformatics (C2B2)/
> National Center for Multiscale Analysis of Genomic Networks (MAGNet)/
> Columbia Initiative in Systems Biology
> Room 824
> Irving Cancer Research Center
> Columbia University
> 1130 St. Nicholas Ave
> New York, NY 10032
> (212)851-4765 (voice)
> friedman at cancercenter.columbia.edu
> http://cancercenter.columbia.edu/~friedman/
> 
> In memoriam, Ray Bradbury
> 
> 
> 	[[alternative HTML version deleted]]
> 
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor