[BioC] Efficiently running DEXSeq for Large Cohorts
Alejandro Reyes
alejandro.reyes at embl.de
Mon Jan 14 10:40:21 CET 2013
Dear Fong Chun Chan,
Thank you for your interest in DEXSeq and sorry in advance for the long
e-mail. We have also noticed that the computing time increases
considerably when you have a large number of samples, conditions or
number of exons of a gene. For users in these situations, we have
implemented a variant of this functions (estimateDispersionsTRT and
testForDEUTRT) in the most recent versions of DEXSeq in the svn.
The difference relies on how the model matrix is prepared, in the
"normal" functions, the model matrices used to fit the glms are prepared
for each exon, such that each exon bin is treated individually,
independently of which exon you are testing. For example, if you have a
gene with 5 exons, when testing for exon E001, you would consider
independently E002, E003, ... , E005 in the model.
In the "TRT" implementation the same model matrix is used for all the
exons. In the same example as before, you would consider E001 and the
sum of all the rest exons of the same gene. This reduces the model and
allows to use DEXSeq with a large number of samples. For more clarity,
you could try to compare the normal model frame of a gene with the TRT
model frame:
data(pasillaExons, package="pasilla")
modelFrameForGene(pasillaExons, "FBgn0000256")
# vs
modelFrameForTRT( pasillaExons )
Using the same example, in the last model frame, "this" would be the
"E001" and "others" would be the sum of E002 + E003 + ... + E005.
This would be the "normal" DEXSeq analysis:
pasillaExons <- estimateSizeFactors( pasillaExons )
pasillaExons <- estimateDispersions( pasillaExons )
pasillaExons <- fitDispersionFunction( pasillaExons )
pasillaExons <- testForDEU( pasillaExons )
This would be the "TRT",
pasillaExonsTRT <- estimateSizeFactors( pasillaExons )
pasillaExonsTRT <- estimateDispersionsTRT( pasillaExons )
pasillaExonsTRT <- fitDispersionFunction( pasillaExons )
pasillaExonsTRT <- testForDEUTRT( pasillaExons )
And you can see that you get the same results:
plot(fData(pasillaExons)$pvalue, fData(pasillaExonsTRT)$pvalue, log="xy")
I have the "TRT" tried this for large cohorts with complex models and it
works nicely and in reasonable computing times.
Best regards,
Alejandro Reyes
ps. this changes need to be added to the vignette.
> Hi all,
>
> I've been trying to get DEXSeq to run on a fairly large RNA-seq cohort that
> I have. To be specific, I have 89 samples and I am attempt to generate DE
> exon usage results on > 500,000 exons.
>
> I've followed the latest tutorial (1.5.6) on Bioconductor and it so far
> I've had relatively no problems. It just the two steps that are mentioned,
> estimateDispersions and testForDEU, are taking a fairly long time. I've
> already attempted to parallelize this on a 48-core 256GB machine, but I get
> very little progress on the run-time of these functions.
>
> I was just wondering if anyone has a good way of running DEXSeq on such a
> large cohort. Tips on how to reduce run time? Are there way to parallelize
> these jobs across a cluster rather than rely on a single machine with
> multi-cores? Any help would be greatly appreciated.
>
> Thanks,
>
> Fong
>
> [[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
More information about the Bioconductor
mailing list