[BioC] Remove batch effect in small RNASeq study (SVA or others?)
Gordon K Smyth
smyth at wehi.EDU.AU
Tue Apr 29 08:28:36 CEST 2014
Shirley,
I specifically recommended to you a particular choice for prior.count,
which I thought would be appropriate for your purpose.
Gordon
On Mon, 28 Apr 2014, Ryan wrote:
> Dear Shirley,
>
> I have only used fold changes based on prior counts for descriptive purposes.
> My understanding is that edgeR uses the counts as is to do the statistics
> (i.e. p-values), but adds in the prior count when computing the logFC/logCPM
> values to avoid taking the log of zero and to avoid excessively large fold
> changes in low-abundance genes. However, I do not have a sense of what values
> are appropriate and I usually stick with the default.
>
> -Ryan
>
> On Mon Apr 28 19:16:41 2014, shirley zhang wrote:
>> Dear Ryan,
>>
>> I think you are right. As shown by ?cpm
>>
>> ## S3 method for class 'DGEList'
>> cpm(x, normalized.lib.sizes=TRUE, log=FALSE, prior.count=0.25, ...)
>>
>> BTW, do you have experience about how to set "prior.count", e.g. 0.25
>> vs. 5?
>>
>> Many thanks,
>> Shirley
>>
>>
>>
>> On Mon, Apr 28, 2014 at 10:02 PM, Ryan <rct at thompsonclan.org
>> <mailto:rct at thompsonclan.org>> wrote:
>>
>> Hi Shirley,
>>
>> I beleive that "normalized.lib.sizes = TRUE" has become the
>> default when calling the cpm function on a DGEList, so you should
>> not need to specify it.
>>
>> -Ryan
>>
>>
>> On Mon Apr 28 18:59:02 2014, shirley zhang wrote:
>>
>> Dear Dr. Smyth,
>>
>> Thank you very much for taking your time to look through my
>> codes, and also
>> provided an more approciate code sequence. Thank you!
>>
>> By following your codes, I think the batch effect was removed
>> efficiently
>> as shown in the attached figures. Do you agree?
>>
>> Furthermore, I found a very useful paper you published in
>> Nature Protocol
>> 2013.
>> http://www.nature.com/nprot/__journal/v8/n9/full/nprot.2013.__099.html
>> <http://www.nature.com/nprot/journal/v8/n9/full/nprot.2013.099.html>
>>
>> In the paper, you wrote:
>> d = DGEList(counts = count, group = samples$condition)
>> d = calcNormFactors(d)
>> nc = cpm(d, normalized.lib.sizes = TRUE)
>>
>> My question is:
>> In my case, should I add option "normalized.lib.sizes = TRUE"
>> in cpm()
>> after calling calcNormFactors() like the following:
>> dge <- DGEList(counts=count)
>> dge <- calcNormFactors(dge)
>> logCPM <- cpm(dge,log=TRUE,prior.count=__5,
>> normalized.lib.sizes = TRUE)
>>
>> Many thanks for your help,
>> Shirley
>>
>>
>>
>> On Mon, Apr 28, 2014 at 8:06 PM, Gordon K Smyth
>> <smyth at wehi.edu.au <mailto:smyth at wehi.edu.au>> wrote:
>>
>> To add colours to the MDS plot:
>>
>> plotMDS(logCPM, col=as.numeric(batch))
>>
>> Gordon
>>
>>
>> On Tue, 29 Apr 2014, Gordon K Smyth wrote:
>>
>> Dear Shirley,
>>
>>
>> I don't think that scaling in prcomp() is appropriate
>> here.
>>
>> Better would be:
>>
>> library(edgeR)
>> dge <- DGEList(counts=count)
>> dge <- calcNormFactors(dge)
>> logCPM <- cpm(dge,log=TRUE,prior.count=__5)
>> plotMDS(logCPM)
>> logCPM <- removeBatchEffect(logCPM,__batch=batch)
>> plotMDS(logCPM)
>>
>> Best wishes
>> Gordon
>>
>>
>> On Mon, 28 Apr 2014, shirley zhang wrote:
>>
>> Dear Dr. Smyth,
>>
>>
>> Thanks for your reply. Here is the code sequence I
>> used to remove the
>> batch
>> effect and to make the PCA plot.
>>
>> First, getting raw counts for each feature per
>> sample using htseq-count
>> (~64K features by using Ensemble.gtf)
>> Then, getting a count data.matrix by combing all
>> samples together (64K
>> rows, and 14 columns). 8 samples from batch1, and
>> 6 samples from batch2.
>>
>> count = cbind(count.s1, count.s2, ...., count.s14)
>>
>>
>>
>> #remove the batch effect
>> library(edgeR)
>> batch = as.factor(c(rep("b1", 8), rep("b2", 6)))
>>
>> logCPM <- cpm(count,log=TRUE,prior.__count=5)
>> logCPM <- removeBatchEffect(logCPM, batch=batch)
>>
>> #PCA
>> B.res = prcomp(logCPM, scale = TRUE)
>> pc.s = summary(.Bres)$importance[2,1:__2]
>> pc1.var = round(pc.s[["PC1"]],2)
>> pc2.var = round(pc.s[["PC2"]],2)
>>
>> pdf(file = "all.count.logCPM.rmBatch.pca"__)
>> plot(B.res$rotation[,1:2], main = maintitle, xlab
>> = paste("PC1
>> (variance:
>> ",pc1.var*100, "%)", sep =""), ylab = paste("PC2
>> (variance:
>> ",pc2.var*100,
>> "%)", sep =""))
>> dev.off()
>>
>> sessionInfo()
>>
>>
>> R version 3.0.2 (2013-09-25)
>> Platform: x86_64-unknown-linux-gnu (64-bit)
>>
>> locale:
>> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
>> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
>> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
>> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
>> [9] LC_ADDRESS=C LC_TELEPHONE=C
>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>>
>> attached base packages:
>> [1] stats graphics grDevices utils
>> datasets methods base
>>
>> other attached packages:
>> [1] edgeR_3.4.0 limma_3.18.7
>>
>>
>> Many thanks.
>> Shirley
>>
>>
>>
>>
>> On Mon, Apr 28, 2014 at 6:31 AM, Gordon K Smyth
>> <smyth at wehi.edu.au <mailto:smyth at wehi.edu.au>>
>> wrote:
>>
>>
>> Date: Sun, 27 Apr 2014 20:32:55 -0400
>>
>> From: shirley zhang <shirley0818 at gmail.com
>> <mailto:shirley0818 at gmail.com>>
>> To: Gordon K Smyth <smyth at wehi.edu.au
>> <mailto:smyth at wehi.edu.au>>
>> Cc: Bioconductor mailing list
>> <bioconductor at r-project.org
>> <mailto:bioconductor at r-project.org>>
>> Subject: Re: [BioC] Remove batch effect in
>> small RNASeq study (SVA or
>> others?)
>>
>> Dear Dr. Smyth,
>>
>> Thank you very much for your quick reply.
>> I did as you suggested by
>> first
>> getting log CPM value, then call
>> removeBatchEffect(). I found the PCA
>> figure looks better than before, but there
>> is still a batch effect.
>>
>>
>> I don't see how there could still be a batch
>> effect.
>>
>> Please give the code sequence you used to
>> remove the batch effect and to
>> make the PCA plot.
>>
>> I attached two PCA figures. One is based on
>> log10(raw count) which is
>>
>> before calling cpm() and
>> removeBatchEffect(). Another one is after.
>>
>> Could you look at them and give me more
>> suggestions. Will a quantile
>> normalization across samples be a good
>> idea since CPM() is still a
>> normalization only within each sample??
>>
>>
>> You should normalize the data before using
>> removeBatchEffect(), and
>> quantile is one possibility.
>>
>> Gordon
>>
>> Thanks again for your help,
>>
>> Shirley
>>
>>
>> On Sun, Apr 27, 2014 at 6:54 PM, Gordon K
>> Smyth <smyth at wehi.edu.au
>> <mailto:smyth at wehi.edu.au>>
>> wrote:
>>
>> Dear Shirley,
>>
>>
>> I would probably do it like this:
>>
>> library(edgeR)
>> logCPM <- cpm(y,log=TRUE,prior.count=5)
>> logCPM <- removeBatchEffect(logCPM,
>> batch=batch)
>>
>> Best wishes
>> Gordon
>>
>> Date: Sat, 26 Apr 2014 10:51:23 -0400
>>
>> From: shirley zhang
>> <shirley0818 at gmail.com
>> <mailto:shirley0818 at gmail.com>>
>>
>> To: Bioconductor Mailing List
>> <bioconductor at stat.math.ethz.__ch
>> <mailto:bioconductor at stat.math.ethz.ch>>
>> Subject: [BioC] Remove batch
>> effect in small RNASeq study (SVA or
>> others?)
>>
>> I have a RNASeq paired-end data
>> from two batches (8 samples from
>> batch1,
>> and 7 samples from batch2). After
>> alignment using TopHat2, then I got
>> count
>> using HTseq-count, and FPKM value
>> via Cufflinks. A big batch effect
>> can
>> be
>> viewed in PCA using both log10(raw
>> count) and log10(FPKM) value.
>>
>> I can NOT use the block factor in
>> edgeR to remove batch effect since
>> I
>> need
>> to first obtain residuals after
>> adjusting for batch effect, then test
>> the
>> residuals for hundreds of
>> thousands of SNPs (eQTL analysis).
>>
>> My question is how to remove batch
>> effect without using edgeR:
>>
>> 1. is SVA ok for such a small
>> sample size (N=15)?
>> 2. If SVA does not work, any other
>> suggestions?
>>
>> Many thanks,
>> Shirley
______________________________________________________________________
The information in this email is confidential and intend...{{dropped:4}}
More information about the Bioconductor
mailing list