[BioC] Influence of expression correlation on false positive ratio
Wolfgang Huber
whuber at embl.de
Fri Jul 13 16:23:46 CEST 2012
Dear Kevin
I agree with your points. Two comments:
- You very appropriately point out below and in your paper that point
estimates of the FDR can be all over the place, and in particular so
when there is a lot of correlation. I have often seen that, too, and it
should probably be much more emphasized to statistics-naive users. But
isn't avoiding that exactly the point of methods like that of
Benjamini-Hochberg or Benjamini-Yekutieli that control FDR under various
types of dependence [1]?
- It is not quite as bad with multivariate methods, if you include (as I
do) clustering, heatmaps and classification. In microarray analysis,
these even predate the gene-by-gene testing. In particular heatmaps can
be surprisingly useful for detecting the major correlation structures.
Things get a bit more ambiguous and less automatable than with
gene-by-gene testing, but I don't think that's a reason not to do it.
Best wishes
Wolfgang
[1] The control of the false discovery rate in multiple testing under
dependency, Yoav Benjamini and Daniel Yekutieli, Ann. Statist. Volume
29, Number 4 (2001), 1165-1188.
Kevin R. Coombes scripsit 07/11/2012 06:22 PM:
> Hi Wolfgang,
>
> It's not just technical artifacts. Everyone believes (probably
> correctly) that gene expression in biological samples is in fact
> correlated, a fact that is exploited all the time when people run
> algorithms to try to (re)construct networks or pathways based on
> coexpression. And while I agree that a truly multivariate approach
> would be more advisable, (a) there is no consensus on how best to do
> this and (b) it is not the current standard practice. There are already
> gazillions of papers (and more are being written and published as I
> write this email) that compute p-values from univariate gene-by-gene
> tests and follow with a method to estimate the FDR.
>
> The operative word here is "estimate", which should make you think that
> there might be some uncertainty in the estimates. We recently did some
> simulations to get an idea of how much the precision of the FDR
> estimates is affected by correlation. We also point out a couple of
> examples from real data that suggest that the effect of correlation
> could be large. The paper has been accepted at BMC Bioinformatics, so I
> can supply the advance URL for people who want more information:
> http://www.biomedcentral.com/1471-2105/13/S13/S1/abstract
>
> Best,
> Kevin
>
> On 7/11/2012 7:22 AM, Wolfgang Huber wrote:
>> January,
>>
>> if you only require per-gene p-values and no multiple testing
>> adjustment, then the dependency is never a problem. The validity of
>> unadjusted per-gene p-values is unaffected by whether there is
>> dependency between the genes.
>>
>> For multiple testing, if you do FWER by the Westfall-Young method, any
>> dependence is also no problem. If you do FDR by the Benjamini-Hochberg
>> method, problems can in principle occur if there is pervasive
>> dependence. Often this is caused by technical artifacts, which would
>> be addressed (and removed) by the methods mentioned by Jeff. If it is
>> biological, then a serial univariate analysis (gene-by-gene testing)
>> does not seem the cleverest choice of approach, and a truly
>> multivariate approach seems more advisable.
>>
>> Best wishes
>> Wolfgang
>>
>>
>> Jeff Leek scripsit 07/09/2012 01:17 PM:
>>> Hi January,
>>>
>>> If the tests are only dependent in small groups, say because genes are
>>> grouped into small modules, then most FDR methods in the p.adjust()
>>> function or the methods in the qvalue package will work. The Bonferroni
>>> correction controls a more conservative error rate, but also holds under
>>> dependence.
>>>
>>> If the sources of dependence are more pervasive, like if there are batch
>>> effects:
>>>
>>> http://www.nature.com/nrg/journal/v11/n10/full/nrg2825.html
>>>
>>> Then you can either use the batch correction methods in Limma if,
>>> say, you
>>> know the date the samples were processed. Or, if you don't know the
>>> sources
>>> of large scale dependence, you can use the sva package:
>>>
>>> http://www.bioconductor.org/packages/devel/bioc/html/sva.html
>>>
>>> which implements the methods described here:
>>>
>>> http://www.pnas.org/content/early/2008/11/24/0808709105.abstract
>>>
>>>
>>> Best,
>>>
>>>
>>> Jeff
>>>
>>>
>>>
>>> On Jul 9, 2012 7:08 AM, "January Weiner"
>>> <january.weiner at mpiib-berlin.mpg.de>
>>> wrote:
>>>
>>>> Hello,
>>>>
>>>> statistical methods for assessing significance of differences in
>>>> expression assume, correct me if I'm wrong, independence of the tests.
>>>> Does anyone have at hand any papers on the performance -- in terms of
>>>> type I error -- of methods such as limma / eBayes? I'm sure this issue
>>>> has been investigated in depth.
>>>>
>>>> Kind regards,
>>>>
>>>> January
>>>>
>>>> --
>>>> -------- Dr. January Weiner 3 --------------------------------------
>>>> Max Planck Institute for Infection Biology
>>>> Charitéplatz 1
>>>> D-10117 Berlin, Germany
>>>> Web : www.mpiib-berlin.mpg.de
>>>> Tel : +49-30-28460514
>>>> Fax : +49-30-28450505
>>>>
>>>> _______________________________________________
>>>> Bioconductor mailing list
>>>> Bioconductor at r-project.org
>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>> Search the archives:
>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>
>>>
>>> [[alternative HTML version deleted]]
>>>
>>>
>>>
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at r-project.org
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives:
>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>
>>
>>
--
Best wishes
Wolfgang
Wolfgang Huber
EMBL
http://www.embl.de/research/units/genome_biology/huber
More information about the Bioconductor
mailing list