[BioC] Removing batch effects with sva and combat using the sva package

Andrew Teschendorff a.teschendorff at ucl.ac.uk
Thu Dec 15 19:00:43 CET 2011


 Dear Bioconductors,

 In addition to Jeff's detailed explanation I think that it is also important to make the community aware
 of some of the potential pitfalls associated with SVA methodologies. In particular, SVA assumes a model (e.g
 linear model) between the phenotype of interest and the actual data. If this model is not an accurate reflection
 of the true (unkown) model, then one can be faced with the scenario where biologically interesting variation (i.e
 variation associated with your phenotype of interest) is still present in the surrogate variable subspace, so subsequent
 adjustment for these specific surrogate variables could then result in an unreasonably weak biological signal. Examples which
 demonstrate this "breakdown" scenario are described in

 http://www.ncbi.nlm.nih.gov/pubmed/21471010
 http://bioinformatics.oxfordjournals.org/content/27/11/1496.long

 So, it is important to check a posteriori that the inferred surrogate variables are not correlating strongly with your phenotype of interest.
 If they are, then it may be dangerous to include them in your subsequent supervised regression analysis. Incorporation of a surrogate
 variable selection step may therefore be necessary. How to perform this surrogate variable selection step in the case where confounders
 are known is described in the above paper.

 kind regards
 A.

***********************************************************************************************************************************************
Andrew E Teschendorff   PhD
Heller Research Fellow
Statistical Cancer Genomics
Paul O'Gorman Building
UCL Cancer Institute
University College London
72 Huntley Street
London WC1E 6BT, UK.

Mob: +44 07876 561263
Email: a.teschendorff at ucl.ac.uk
http://www.ucl.ac.uk/cancer/research-groups/statistical_cancer_genomics/index.htm
********************************************************************************************************************************************
________________________________________
From: bioconductor-bounces at r-project.org [bioconductor-bounces at r-project.org] On Behalf Of Jeff Leek [jtleek at gmail.com]
Sent: 15 December 2011 17:16
To: bioconductor at r-project.org
Subject: [BioC] Removing batch effects with sva and combat using the sva        package

The latest version of the sva package is now available at Bioconductor:

http://bioconductor.org/packages/release/bioc/html/sva.html

This version includes support for both surrogate variable analysis, as
described in the papers:

http://www.biostat.jhsph.edu/~jleek/papers/sva.pdf
http://www.biostat.jhsph.edu/~jleek/papers/framework.pdf

and for Combat, an approach for removing batch effects when the source of
batch is known as described in the paper:

http://biostatistics.oxfordjournals.org/content/early/2006/04/21/biostatistics.kxj037.full.pdf

A full description of how to use the methods, including how to use the sva
package with limma, removing batch effects with linear models, biological
versus technical batch effects, direct adjustment versus surrogate variable
adjustment, and batch effects for prediction is available here:

http://bioconductor.org/packages/2.9/bioc/vignettes/sva/inst/doc/sva.pdf

Several recent questions have focused on removing batch effects from gene
expression or other high-throughput data as a cleaning step prior to
performing other analyses. An important point about batch effect correction
(whether with sva, combat, or any other currently published approach) is
that a regression analysis is performed and variation is removed from the
data. So subsequent analyses using a "cleaned" version of the data should
be performed with caution. In particular, methods use to infer networks or
to illustrate patterns (MDS/PCA) should be used with caution after
regressing out batch effects. All currently published batch effect removal
methods focus on adjusting batch effects for differential expression.

That being said, the sva package can be used to "clean" a data set as
follows: (1) use the sva() function as described in the vignette to run sva
and store the sva object. (2) input the data set into the fsva() function,
along with the model matrix used to define the sva object, and the
surrogate variable object. The db variable that is returned from this
command will be a "clean" version of the original data set.

        [[alternative HTML version deleted]]

_______________________________________________
Bioconductor mailing list
Bioconductor at r-project.org
https://stat.ethz.ch/mailman/listinfo/bioconductor
Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor



More information about the Bioconductor mailing list