[BioC] Efficiently running DEXSeq for Large Cohorts

Mon Jan 14 10:40:21 CET 2013

Dear Fong Chun Chan,

Thank you for your interest in DEXSeq and sorry in advance for the long 
e-mail. We have also noticed that the computing time increases 
considerably when you have a large number of samples, conditions or 
number of exons of a gene. For users in these situations, we have 
implemented a variant of this functions (estimateDispersionsTRT and 
testForDEUTRT) in the most recent versions of DEXSeq in the svn.

The difference relies on how the model matrix is prepared, in the 
"normal" functions, the model matrices used to fit the glms are prepared 
for each exon, such that each exon bin is treated individually, 
independently of which exon you are testing. For example, if you have a 
gene with 5 exons, when testing for exon E001, you would consider 
independently E002, E003, ... , E005 in the model.

In the "TRT" implementation the same model matrix is used for all the 
exons. In the same example as before, you would consider E001 and the 
sum of all the rest exons of the same gene. This reduces the model and 
allows to use DEXSeq with a large number of samples. For more clarity, 
you could try to compare the normal model frame of a gene with the TRT 
model frame:

data(pasillaExons, package="pasilla")
modelFrameForGene(pasillaExons, "FBgn0000256")
# vs
modelFrameForTRT( pasillaExons )

Using the same example, in the last model frame, "this" would be the 
"E001" and "others" would be the sum of E002 + E003 + ... + E005.

This would be the "normal" DEXSeq analysis:

pasillaExons <- estimateSizeFactors( pasillaExons )
pasillaExons <- estimateDispersions( pasillaExons )
pasillaExons <- fitDispersionFunction( pasillaExons )
pasillaExons <- testForDEU( pasillaExons )

This would be the "TRT",

pasillaExonsTRT <- estimateSizeFactors( pasillaExons )
pasillaExonsTRT <- estimateDispersionsTRT( pasillaExons )
pasillaExonsTRT <- fitDispersionFunction( pasillaExons )
pasillaExonsTRT <- testForDEUTRT( pasillaExons )

And you can see that you get the same results:

plot(fData(pasillaExons)$pvalue, fData(pasillaExonsTRT)$pvalue, log="xy")

I have the "TRT" tried this for large cohorts with complex models and it 
works nicely and in reasonable computing times.

Best regards,
Alejandro Reyes

ps. this changes need to be added to the vignette.

> Hi all,
>
> I've been trying to get DEXSeq to run on a fairly large RNA-seq cohort that
> I have. To be specific, I have 89 samples and I am attempt to generate DE
> exon usage results on > 500,000 exons.
>
> I've followed the latest tutorial (1.5.6) on Bioconductor and it so far
> I've had relatively no problems. It just the two steps that are mentioned,
> estimateDispersions and testForDEU, are taking a fairly long time. I've
> already attempted to parallelize this on a 48-core 256GB machine, but I get
> very little progress on the run-time of these functions.
>
> I was just wondering if anyone has a good way of running DEXSeq on such a
> large cohort. Tips on how to reduce run time? Are there way to parallelize
> these jobs across a cluster rather than rely on a single machine with
> multi-cores? Any help would be greatly appreciated.
>
> Thanks,
>
> Fong
>
> 	[[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor