[BioC] Efficiently running DEXSeq for Large Cohorts

Alejandro Reyes alejandro.reyes at embl.de
Mon Jan 14 10:44:39 CET 2013


Sorry, there was an error in part of my code:

The "TRT" would be like this,

data(pasillaExons, package="pasilla")
pasillaExonsTRT <- estimateSizeFactors( pasillaExons )
pasillaExonsTRT <- estimateDispersionsTRT( pasillaExonsTRT )
pasillaExonsTRT <- fitDispersionFunction( pasillaExonsTRT )
pasillaExonsTRT <- testForDEUTRT( pasillaExonsTRT )

Bests,
Alejandro

> Dear Fong Chun Chan,
>
> Thank you for your interest in DEXSeq and sorry in advance for the 
> long e-mail. We have also noticed that the computing time increases 
> considerably when you have a large number of samples, conditions or 
> number of exons of a gene. For users in these situations, we have 
> implemented a variant of this functions (estimateDispersionsTRT and 
> testForDEUTRT) in the most recent versions of DEXSeq in the svn.
>
> The difference relies on how the model matrix is prepared, in the 
> "normal" functions, the model matrices used to fit the glms are 
> prepared for each exon, such that each exon bin is treated 
> individually, independently of which exon you are testing. For 
> example, if you have a gene with 5 exons, when testing for exon E001, 
> you would consider independently E002, E003, ... , E005 in the model.
>
> In the "TRT" implementation the same model matrix is used for all the 
> exons. In the same example as before, you would consider E001 and the 
> sum of all the rest exons of the same gene. This reduces the model and 
> allows to use DEXSeq with a large number of samples. For more clarity, 
> you could try to compare the normal model frame of a gene with the TRT 
> model frame:
>
> data(pasillaExons, package="pasilla")
> modelFrameForGene(pasillaExons, "FBgn0000256")
> # vs
> modelFrameForTRT( pasillaExons )
>
> Using the same example, in the last model frame, "this" would be the 
> "E001" and "others" would be the sum of E002 + E003 + ... + E005.
>
> This would be the "normal" DEXSeq analysis:
>
> pasillaExons <- estimateSizeFactors( pasillaExons )
> pasillaExons <- estimateDispersions( pasillaExons )
> pasillaExons <- fitDispersionFunction( pasillaExons )
> pasillaExons <- testForDEU( pasillaExons )
>
> This would be the "TRT",
>
> pasillaExonsTRT <- estimateSizeFactors( pasillaExons )
> pasillaExonsTRT <- estimateDispersionsTRT( pasillaExons )
> pasillaExonsTRT <- fitDispersionFunction( pasillaExons )
> pasillaExonsTRT <- testForDEUTRT( pasillaExons )
>
> And you can see that you get the same results:
>
> plot(fData(pasillaExons)$pvalue, fData(pasillaExonsTRT)$pvalue, log="xy")
>
> I have the "TRT" tried this for large cohorts with complex models and 
> it works nicely and in reasonable computing times.
>
> Best regards,
> Alejandro Reyes
>
> ps. this changes need to be added to the vignette.
>
>
>> Hi all,
>>
>> I've been trying to get DEXSeq to run on a fairly large RNA-seq 
>> cohort that
>> I have. To be specific, I have 89 samples and I am attempt to 
>> generate DE
>> exon usage results on > 500,000 exons.
>>
>> I've followed the latest tutorial (1.5.6) on Bioconductor and it so far
>> I've had relatively no problems. It just the two steps that are 
>> mentioned,
>> estimateDispersions and testForDEU, are taking a fairly long time. I've
>> already attempted to parallelize this on a 48-core 256GB machine, but 
>> I get
>> very little progress on the run-time of these functions.
>>
>> I was just wondering if anyone has a good way of running DEXSeq on 
>> such a
>> large cohort. Tips on how to reduce run time? Are there way to 
>> parallelize
>> these jobs across a cluster rather than rely on a single machine with
>> multi-cores? Any help would be greatly appreciated.
>>
>> Thanks,
>>
>> Fong
>>
>>     [[alternative HTML version deleted]]
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives: 
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: 
> http://news.gmane.org/gmane.science.biology.informatics.conductor



More information about the Bioconductor mailing list