[BioC] Influence of expression correlation on false positive ratio

Fri Jul 13 16:23:46 CEST 2012

Dear Kevin

I agree with your points. Two comments:

- You very appropriately point out below and in your paper that point 
estimates of the FDR can be all over the place, and in particular so 
when there is a lot of correlation. I have often seen that, too, and it 
should probably be much more emphasized to statistics-naive users. But 
isn't avoiding that exactly the point of methods like that of 
Benjamini-Hochberg or Benjamini-Yekutieli that control FDR under various 
types of dependence [1]?

- It is not quite as bad with multivariate methods, if you include (as I 
do) clustering, heatmaps and classification. In microarray analysis, 
these even predate the gene-by-gene testing. In particular heatmaps can 
be surprisingly useful for detecting the major correlation structures. 
Things get a bit more ambiguous and less automatable than with 
gene-by-gene testing, but I don't think that's a reason not to do it.

	Best wishes
	Wolfgang

[1] The control of the false discovery rate in multiple testing under 
dependency, Yoav Benjamini and Daniel Yekutieli, Ann. Statist. Volume 
29, Number 4 (2001), 1165-1188.

Kevin R. Coombes scripsit 07/11/2012 06:22 PM:
> Hi Wolfgang,
>
> It's not just technical artifacts.  Everyone believes (probably
> correctly) that gene expression in biological samples is in fact
> correlated, a fact that is exploited all the time when people run
> algorithms to try to (re)construct networks or pathways based on
> coexpression.  And while I agree that a truly multivariate approach
> would be more advisable, (a) there is no consensus on how best to do
> this and (b) it is not the current standard practice.  There are already
> gazillions of papers (and more are being written and published as I
> write  this email) that compute p-values from univariate gene-by-gene
> tests and follow with a method to estimate the FDR.
>
> The operative word here is "estimate", which should make you think that
> there might be some uncertainty in the estimates.  We recently did some
> simulations to get an idea of how much the precision of the FDR
> estimates is affected by correlation.  We also point out a couple of
> examples from real data that suggest that the effect of correlation
> could be large.  The paper has been accepted at BMC Bioinformatics, so I
> can supply the advance URL for people who want more information:
> http://www.biomedcentral.com/1471-2105/13/S13/S1/abstract
>
> Best,
>      Kevin
>
> On 7/11/2012 7:22 AM, Wolfgang Huber wrote:
>> January,
>>
>> if you only require per-gene p-values and no multiple testing
>> adjustment, then the dependency is never a problem. The validity of
>> unadjusted per-gene p-values is unaffected by whether there is
>> dependency between the genes.
>>
>> For multiple testing, if you do FWER by the Westfall-Young method, any
>> dependence is also no problem. If you do FDR by the Benjamini-Hochberg
>> method, problems can in principle occur if there is pervasive
>> dependence. Often this is caused by technical artifacts, which would
>> be addressed (and removed) by the methods mentioned by Jeff. If it is
>> biological, then a serial univariate analysis (gene-by-gene testing)
>> does not seem the cleverest choice of approach, and a truly
>> multivariate approach seems more advisable.
>>
>>     Best wishes
>>     Wolfgang
>>
>>
>> Jeff Leek scripsit 07/09/2012 01:17 PM:
>>> Hi January,
>>>
>>> If the tests are only dependent in small groups, say because genes are
>>> grouped into small modules,  then most FDR methods in the p.adjust()
>>> function or the methods in the qvalue package will work. The Bonferroni
>>> correction controls a more conservative error rate, but also holds under
>>> dependence.
>>>
>>> If the sources of dependence are more pervasive, like if there are batch
>>> effects:
>>>
>>> http://www.nature.com/nrg/journal/v11/n10/full/nrg2825.html
>>>
>>> Then you can either use the batch correction methods in Limma if,
>>> say, you
>>> know the date the samples were processed. Or, if you don't know the
>>> sources
>>> of large scale dependence, you can use the sva package:
>>>
>>> http://www.bioconductor.org/packages/devel/bioc/html/sva.html
>>>
>>> which implements the methods described here:
>>>
>>> http://www.pnas.org/content/early/2008/11/24/0808709105.abstract
>>>
>>>
>>> Best,
>>>
>>>
>>> Jeff
>>>
>>>
>>>
>>> On Jul 9, 2012 7:08 AM, "January Weiner"
>>> <january.weiner at mpiib-berlin.mpg.de>
>>> wrote:
>>>
>>>> Hello,
>>>>
>>>> statistical methods for assessing significance of differences in
>>>> expression assume, correct me if I'm wrong, independence of the tests.
>>>> Does anyone have at hand any papers on the performance -- in terms of
>>>> type I error -- of methods such as limma / eBayes? I'm sure this issue
>>>> has been investigated in depth.
>>>>
>>>> Kind regards,
>>>>
>>>> January
>>>>
>>>> --
>>>> -------- Dr. January Weiner 3 --------------------------------------
>>>> Max Planck Institute for Infection Biology
>>>> Charitéplatz 1
>>>> D-10117 Berlin, Germany
>>>> Web   : www.mpiib-berlin.mpg.de
>>>> Tel     : +49-30-28460514
>>>> Fax    : +49-30-28450505
>>>>
>>>> _______________________________________________
>>>> Bioconductor mailing list
>>>> Bioconductor at r-project.org
>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>> Search the archives:
>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>
>>>
>>>     [[alternative HTML version deleted]]
>>>
>>>
>>>
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at r-project.org
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives:
>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>
>>
>>

-- 
Best wishes
	Wolfgang

Wolfgang Huber
EMBL
http://www.embl.de/research/units/genome_biology/huber