[BioC] RNA-seq differentially expressed gene finding methods

Sat Sep 6 21:39:12 CEST 2014

Son

of course you are right. Here’s an excerpt of our 2010 Genome Biology paper:

Conclusions
Why is it necessary to develop new statistical methodology for sequence count data? If large numbers of replicates were available, questions of data distribution could be avoided by using non-parametric methods, such as rank-based or permutation tests. However, it is desirable (and possible) to consider experiments with smaller numbers of replicates per condition. In order to compare an observed difference with an expected random variation, we can improve our picture of the latter in two ways: first, we can use distribution families, such as normal, Poisson and negative binomial distributions, in order to determine the higher moments, and hence the tail behavior, of statistics for differential expression, based on observed low order moments such as mean and variance. Second, we can share information, for instance, distributional parameters, between genes, based on the notion that data from different genes follow similar patterns of variability. Here, we have described an instance of such an approach, ...

Btw, t-test can be perfectly “valid” even if the data are non-Normal, in particular, when they are fatter. The test then just looses power, sometimes badly so.
I find it odd that so many people worry about that so much. Correlations between samples (e.g. ‘batch effects’) are much more problematic.

Best wishes
Wolfgang

Il giorno 05 Sep 2014, alle ore 19:31, Son Pham <spham at salk.edu> ha scritto:

> Thank you Richard,  Devon and Paul for very insight answers.
> I completely agree that the approach I raised above is inappropriate when
> the group size is small (3, 4...).
> But when the group size is large enough ( > 20 or 30), the sampling
> distribution of the mean will be (closed to) normally distributed, and that
> is why I believe that the t-test is ok.
> 
> 
> -Son.
> 
> 
> 
> 
> On Fri, Sep 5, 2014 at 10:05 AM, Paul Geeleher <paulgeeleher at gmail.com>
> wrote:
> 
>> Hi Son,
>> 
>> My understanding is that the approach you describe could be considered
>> valid for large enough numbers of samples, however, RNA-seq
>> experiments will typically have smaller numbers (<30) samples per
>> condition, meaning that a t-test is not valid (because RNA-seq data
>> isn't normally distributed). However, while I don't think that a
>> t-test is "invalid" given enough samples, its very difficult to
>> justify using such a method when much better powered methods have been
>> invented specifically for this type of data.
>> 
>> Paul
>> 
>> On Fri, Sep 5, 2014 at 11:52 AM, Richard Friedman
>> <friedman at c2b2.columbia.edu> wrote:
>>> Dear Son,
>>> 
>>>        The t-test assumes a normal distribution,
>>> which is appropriate for continous variables. RNAseq
>>> data deals with counts (discrete entities). A negative binomial
>> distribution
>>> (EdgeR, Deseq) or a mean dependent variance (VOOM)
>>> is much more approriate. Also the 3 methods mentioned
>>> above estimate variablity better with information from all genes
>>> using empirical Bayesian methods, than does the one-gene
>>> at-a-time frequentist t-test.
>>> 
>>> Best wishes,
>>> Rich
>>> Richard A. Friedman, PhD
>>> Associate Research Scientist,
>>> Biomedical Informatics Shared Resource
>>> Herbert Irving Comprehensive Cancer Center (HICCC)
>>> Lecturer,
>>> Department of Biomedical Informatics (DBMI)
>>> Educational Coordinator,
>>> Center for Computational Biology and Bioinformatics (C2B2)/
>>> National Center for Multiscale Analysis of Genomic Networks (MAGNet)/
>>> Columbia Department of Systems Biology
>>> Room 824
>>> Irving Cancer Research Center
>>> Columbia University
>>> 1130 St. Nicholas Ave
>>> New York, NY 10032
>>> (212)851-4765 (voice)
>>> friedman at c2b2.columbia.edu
>>> http://friedman.c2b2.columbia.edu/
>>> 
>>> "There is nothing in my Contemporary Jewish Literature course that is
>>> either contemporary, Jewish, or literature".
>>> 
>>> -Rose Friedman, age 17
>>> 
>>> 
>>> On Sep 5, 2014, at 12:44 PM, Son Pham wrote:
>>> 
>>>> Dear all,
>>>> I know that we have quite very good packages (edgeR, deseq) that
>> calculate
>>>> the list of differentially expressed genes in 2 conditions (with
>>>> replicates) from raw counts. But I do not know what is wrong with the
>>>> following simple approach (and whether other people have been using it):
>>>> 
>>>> 1. Get the (estimated) tpm/fpkm for each gene in each sample
>>>> 2. Do a t-test for two groups on each gene.
>>>> 3. Adjust the p value for multiple tests (p-adj)
>>>> 
>>>> 
>>>> Thanks,
>>>> 
>>>> Son.
>>>> 
>>>>      [[alternative HTML version deleted]]
>>>> 
>>>> _______________________________________________
>>>> Bioconductor mailing list
>>>> Bioconductor at r-project.org
>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>> 
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at r-project.org
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>> 
>> 
>> 
>> --
>> Dr. Paul Geeleher, PhD
>> Section of Hematology-Oncology
>> Department of Medicine
>> The University of Chicago
>> 900 E. 57th St.,
>> KCBD, Room 7144
>> Chicago, IL 60637
>> --
>> www.bioinformaticstutorials.com
>> 
> 
> 	[[alternative HTML version deleted]]
> 
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor