[BioC] Remove batch effect in small RNASeq study (SVA or others?)
Ryan
rct at thompsonclan.org
Tue Apr 29 08:18:54 CEST 2014
Dear Shirley,
I have only used fold changes based on prior counts for descriptive
purposes. My understanding is that edgeR uses the counts as is to do
the statistics (i.e. p-values), but adds in the prior count when
computing the logFC/logCPM values to avoid taking the log of zero and
to avoid excessively large fold changes in low-abundance genes.
However, I do not have a sense of what values are appropriate and I
usually stick with the default.
-Ryan
On Mon Apr 28 19:16:41 2014, shirley zhang wrote:
> Dear Ryan,
>
> I think you are right. As shown by ?cpm
>
> ## S3 method for class 'DGEList'
> cpm(x, normalized.lib.sizes=TRUE, log=FALSE, prior.count=0.25, ...)
>
> BTW, do you have experience about how to set "prior.count", e.g. 0.25
> vs. 5?
>
> Many thanks,
> Shirley
>
>
>
> On Mon, Apr 28, 2014 at 10:02 PM, Ryan <rct at thompsonclan.org
> <mailto:rct at thompsonclan.org>> wrote:
>
> Hi Shirley,
>
> I beleive that "normalized.lib.sizes = TRUE" has become the
> default when calling the cpm function on a DGEList, so you should
> not need to specify it.
>
> -Ryan
>
>
> On Mon Apr 28 18:59:02 2014, shirley zhang wrote:
>
> Dear Dr. Smyth,
>
> Thank you very much for taking your time to look through my
> codes, and also
> provided an more approciate code sequence. Thank you!
>
> By following your codes, I think the batch effect was removed
> efficiently
> as shown in the attached figures. Do you agree?
>
> Furthermore, I found a very useful paper you published in
> Nature Protocol
> 2013.
> http://www.nature.com/nprot/__journal/v8/n9/full/nprot.2013.__099.html
> <http://www.nature.com/nprot/journal/v8/n9/full/nprot.2013.099.html>
>
> In the paper, you wrote:
> d = DGEList(counts = count, group = samples$condition)
> d = calcNormFactors(d)
> nc = cpm(d, normalized.lib.sizes = TRUE)
>
> My question is:
> In my case, should I add option "normalized.lib.sizes = TRUE"
> in cpm()
> after calling calcNormFactors() like the following:
> dge <- DGEList(counts=count)
> dge <- calcNormFactors(dge)
> logCPM <- cpm(dge,log=TRUE,prior.count=__5,
> normalized.lib.sizes = TRUE)
>
> Many thanks for your help,
> Shirley
>
>
>
> On Mon, Apr 28, 2014 at 8:06 PM, Gordon K Smyth
> <smyth at wehi.edu.au <mailto:smyth at wehi.edu.au>> wrote:
>
> To add colours to the MDS plot:
>
> plotMDS(logCPM, col=as.numeric(batch))
>
> Gordon
>
>
> On Tue, 29 Apr 2014, Gordon K Smyth wrote:
>
> Dear Shirley,
>
>
> I don't think that scaling in prcomp() is appropriate
> here.
>
> Better would be:
>
> library(edgeR)
> dge <- DGEList(counts=count)
> dge <- calcNormFactors(dge)
> logCPM <- cpm(dge,log=TRUE,prior.count=__5)
> plotMDS(logCPM)
> logCPM <- removeBatchEffect(logCPM,__batch=batch)
> plotMDS(logCPM)
>
> Best wishes
> Gordon
>
>
> On Mon, 28 Apr 2014, shirley zhang wrote:
>
> Dear Dr. Smyth,
>
>
> Thanks for your reply. Here is the code sequence I
> used to remove the
> batch
> effect and to make the PCA plot.
>
> First, getting raw counts for each feature per
> sample using htseq-count
> (~64K features by using Ensemble.gtf)
> Then, getting a count data.matrix by combing all
> samples together (64K
> rows, and 14 columns). 8 samples from batch1, and
> 6 samples from batch2.
>
> count = cbind(count.s1, count.s2, ...., count.s14)
>
>
>
> #remove the batch effect
> library(edgeR)
> batch = as.factor(c(rep("b1", 8), rep("b2", 6)))
>
> logCPM <- cpm(count,log=TRUE,prior.__count=5)
> logCPM <- removeBatchEffect(logCPM, batch=batch)
>
> #PCA
> B.res = prcomp(logCPM, scale = TRUE)
> pc.s = summary(.Bres)$importance[2,1:__2]
> pc1.var = round(pc.s[["PC1"]],2)
> pc2.var = round(pc.s[["PC2"]],2)
>
> pdf(file = "all.count.logCPM.rmBatch.pca"__)
> plot(B.res$rotation[,1:2], main = maintitle, xlab
> = paste("PC1
> (variance:
> ",pc1.var*100, "%)", sep =""), ylab = paste("PC2
> (variance:
> ",pc2.var*100,
> "%)", sep =""))
> dev.off()
>
> sessionInfo()
>
>
> R version 3.0.2 (2013-09-25)
> Platform: x86_64-unknown-linux-gnu (64-bit)
>
> locale:
> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
> [9] LC_ADDRESS=C LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>
> attached base packages:
> [1] stats graphics grDevices utils
> datasets methods base
>
> other attached packages:
> [1] edgeR_3.4.0 limma_3.18.7
>
>
> Many thanks.
> Shirley
>
>
>
>
> On Mon, Apr 28, 2014 at 6:31 AM, Gordon K Smyth
> <smyth at wehi.edu.au <mailto:smyth at wehi.edu.au>>
> wrote:
>
>
> Date: Sun, 27 Apr 2014 20:32:55 -0400
>
> From: shirley zhang <shirley0818 at gmail.com
> <mailto:shirley0818 at gmail.com>>
> To: Gordon K Smyth <smyth at wehi.edu.au
> <mailto:smyth at wehi.edu.au>>
> Cc: Bioconductor mailing list
> <bioconductor at r-project.org
> <mailto:bioconductor at r-project.org>>
> Subject: Re: [BioC] Remove batch effect in
> small RNASeq study (SVA or
> others?)
>
> Dear Dr. Smyth,
>
> Thank you very much for your quick reply.
> I did as you suggested by
> first
> getting log CPM value, then call
> removeBatchEffect(). I found the PCA
> figure looks better than before, but there
> is still a batch effect.
>
>
> I don't see how there could still be a batch
> effect.
>
> Please give the code sequence you used to
> remove the batch effect and to
> make the PCA plot.
>
> I attached two PCA figures. One is based on
> log10(raw count) which is
>
> before calling cpm() and
> removeBatchEffect(). Another one is after.
>
> Could you look at them and give me more
> suggestions. Will a quantile
> normalization across samples be a good
> idea since CPM() is still a
> normalization only within each sample??
>
>
> You should normalize the data before using
> removeBatchEffect(), and
> quantile is one possibility.
>
> Gordon
>
> Thanks again for your help,
>
> Shirley
>
>
> On Sun, Apr 27, 2014 at 6:54 PM, Gordon K
> Smyth <smyth at wehi.edu.au
> <mailto:smyth at wehi.edu.au>>
> wrote:
>
> Dear Shirley,
>
>
> I would probably do it like this:
>
> library(edgeR)
> logCPM <- cpm(y,log=TRUE,prior.count=5)
> logCPM <- removeBatchEffect(logCPM,
> batch=batch)
>
> Best wishes
> Gordon
>
> Date: Sat, 26 Apr 2014 10:51:23 -0400
>
> From: shirley zhang
> <shirley0818 at gmail.com
> <mailto:shirley0818 at gmail.com>>
>
> To: Bioconductor Mailing List
> <bioconductor at stat.math.ethz.__ch
> <mailto:bioconductor at stat.math.ethz.ch>>
> Subject: [BioC] Remove batch
> effect in small RNASeq study (SVA or
> others?)
>
> I have a RNASeq paired-end data
> from two batches (8 samples from
> batch1,
> and 7 samples from batch2). After
> alignment using TopHat2, then I got
> count
> using HTseq-count, and FPKM value
> via Cufflinks. A big batch effect
> can
> be
> viewed in PCA using both log10(raw
> count) and log10(FPKM) value.
>
> I can NOT use the block factor in
> edgeR to remove batch effect since
> I
> need
> to first obtain residuals after
> adjusting for batch effect, then test
> the
> residuals for hundreds of
> thousands of SNPs (eQTL analysis).
>
> My question is how to remove batch
> effect without using edgeR:
>
> 1. is SVA ok for such a small
> sample size (N=15)?
> 2. If SVA does not work, any other
> suggestions?
>
> Many thanks,
> Shirley
>
>
>
> ______________________________________________________________
>
> __________
> The information in this email is confidential
> and intended solely for
> the
> addressee.
> You must not disclose, forward, print or use
> it without the permission
> of
> the sender.
> __________________________________________________________________________
>
>
>
>
> __________________________________________________________________________
> The information in this email is confidential and intended
> solely for the
> addressee.
> You must not disclose, forward, print or use it without
> the permission of
> the sender.
> __________________________________________________________________________
>
>
>
> _________________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org <mailto:Bioconductor at r-project.org>
> https://stat.ethz.ch/mailman/__listinfo/bioconductor
> <https://stat.ethz.ch/mailman/listinfo/bioconductor>
> Search the archives:
> http://news.gmane.org/gmane.__science.biology.informatics.__conductor
> <http://news.gmane.org/gmane.science.biology.informatics.conductor>
>
>
More information about the Bioconductor
mailing list