[BioC] Remove batch effect in small RNASeq study (SVA or others?)

Tue Apr 29 08:28:36 CEST 2014

Shirley,

I specifically recommended to you a particular choice for prior.count, 
which I thought would be appropriate for your purpose.

Gordon

On Mon, 28 Apr 2014, Ryan wrote:

> Dear Shirley,
>
> I have only used fold changes based on prior counts for descriptive purposes. 
> My understanding is that edgeR uses the counts as is to do the statistics 
> (i.e. p-values), but adds in the prior count when computing the logFC/logCPM 
> values to avoid taking the log of zero and to avoid excessively large fold 
> changes in low-abundance genes. However, I do not have a sense of what values 
> are appropriate and I usually stick with the default.
>
> -Ryan
>
> On Mon Apr 28 19:16:41 2014, shirley zhang wrote:
>> Dear Ryan,
>> 
>> I think you are right. As shown by ?cpm
>>
>>  ## S3 method for class 'DGEList'
>>      cpm(x, normalized.lib.sizes=TRUE, log=FALSE, prior.count=0.25, ...)
>> 
>> BTW, do you have experience about how to set "prior.count", e.g. 0.25
>> vs. 5?
>> 
>> Many thanks,
>> Shirley
>> 
>> 
>> 
>> On Mon, Apr 28, 2014 at 10:02 PM, Ryan <rct at thompsonclan.org
>> <mailto:rct at thompsonclan.org>> wrote:
>>
>>     Hi Shirley,
>>
>>     I beleive that "normalized.lib.sizes = TRUE" has become the
>>     default when calling the cpm function on a DGEList, so you should
>>     not need to specify it.
>>
>>     -Ryan
>> 
>>
>>     On Mon Apr 28 18:59:02 2014, shirley zhang wrote:
>>
>>           Dear Dr. Smyth,
>>
>>         Thank you very much for taking your time to look through my
>>         codes, and also
>>         provided an more approciate code sequence. Thank you!
>>
>>         By following your codes, I think the batch effect was removed
>>         efficiently
>>         as shown in the attached figures. Do you agree?
>>
>>         Furthermore, I found a very useful paper you published in
>>         Nature Protocol
>>         2013.
>>         http://www.nature.com/nprot/__journal/v8/n9/full/nprot.2013.__099.html
>>         <http://www.nature.com/nprot/journal/v8/n9/full/nprot.2013.099.html>
>>
>>         In the paper, you wrote:
>>         d = DGEList(counts = count, group = samples$condition)
>>         d = calcNormFactors(d)
>>         nc = cpm(d, normalized.lib.sizes = TRUE)
>>
>>         My question is:
>>         In my case, should I add option "normalized.lib.sizes = TRUE"
>>         in cpm()
>>         after calling calcNormFactors() like the following:
>>         dge <- DGEList(counts=count)
>>         dge <- calcNormFactors(dge)
>>         logCPM <- cpm(dge,log=TRUE,prior.count=__5,
>>         normalized.lib.sizes = TRUE)
>>
>>         Many thanks for your help,
>>         Shirley
>> 
>> 
>>
>>         On Mon, Apr 28, 2014 at 8:06 PM, Gordon K Smyth
>>         <smyth at wehi.edu.au <mailto:smyth at wehi.edu.au>> wrote:
>>
>>             To add colours to the MDS plot:
>>
>>               plotMDS(logCPM, col=as.numeric(batch))
>>
>>             Gordon
>> 
>>
>>             On Tue, 29 Apr 2014, Gordon K Smyth wrote:
>>
>>               Dear Shirley,
>> 
>>
>>                 I don't think that scaling in prcomp() is appropriate
>>                 here.
>>
>>                 Better would be:
>>
>>                 library(edgeR)
>>                 dge <- DGEList(counts=count)
>>                 dge <- calcNormFactors(dge)
>>                 logCPM <- cpm(dge,log=TRUE,prior.count=__5)
>>                 plotMDS(logCPM)
>>                 logCPM <- removeBatchEffect(logCPM,__batch=batch)
>>                 plotMDS(logCPM)
>>
>>                 Best wishes
>>                 Gordon
>> 
>>
>>                 On Mon, 28 Apr 2014, shirley zhang wrote:
>>
>>                   Dear Dr. Smyth,
>> 
>>
>>                     Thanks for your reply. Here is the code sequence I
>>                     used to remove the
>>                     batch
>>                     effect and to make the PCA plot.
>>
>>                     First,  getting raw counts for each feature per
>>                     sample using htseq-count
>>                     (~64K features by using Ensemble.gtf)
>>                     Then, getting a count data.matrix by combing all
>>                     samples together (64K
>>                     rows, and 14 columns). 8 samples from batch1, and
>>                     6 samples from batch2.
>>
>>                       count = cbind(count.s1, count.s2, ...., count.s14)
>> 
>> 
>>
>>                     #remove the batch effect
>>                     library(edgeR)
>>                     batch = as.factor(c(rep("b1", 8), rep("b2", 6)))
>>
>>                     logCPM <- cpm(count,log=TRUE,prior.__count=5)
>>                     logCPM <- removeBatchEffect(logCPM, batch=batch)
>>
>>                     #PCA
>>                     B.res = prcomp(logCPM, scale = TRUE)
>>                     pc.s = summary(.Bres)$importance[2,1:__2]
>>                     pc1.var = round(pc.s[["PC1"]],2)
>>                     pc2.var = round(pc.s[["PC2"]],2)
>>
>>                     pdf(file = "all.count.logCPM.rmBatch.pca"__)
>>                     plot(B.res$rotation[,1:2], main = maintitle,  xlab
>>                     = paste("PC1
>>                     (variance:
>>                     ",pc1.var*100, "%)", sep =""), ylab = paste("PC2
>>                     (variance:
>>                     ",pc2.var*100,
>>                     "%)", sep =""))
>>                     dev.off()
>>
>>                       sessionInfo()
>> 
>>
>>                     R version 3.0.2 (2013-09-25)
>>                     Platform: x86_64-unknown-linux-gnu (64-bit)
>>
>>                     locale:
>>                     [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>>                     [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>>                     [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
>>                     [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
>>                     [9] LC_ADDRESS=C               LC_TELEPHONE=C
>>                     [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>>
>>                     attached base packages:
>>                     [1] stats     graphics  grDevices utils
>>                     datasets  methods   base
>>
>>                     other attached packages:
>>                     [1] edgeR_3.4.0  limma_3.18.7
>> 
>>
>>                     Many thanks.
>>                     Shirley
>> 
>> 
>> 
>>
>>                     On Mon, Apr 28, 2014 at 6:31 AM, Gordon K Smyth
>>                     <smyth at wehi.edu.au <mailto:smyth at wehi.edu.au>>
>>                     wrote:
>> 
>>
>>                           Date: Sun, 27 Apr 2014 20:32:55 -0400
>>
>>                             From: shirley zhang <shirley0818 at gmail.com
>>                             <mailto:shirley0818 at gmail.com>>
>>                             To: Gordon K Smyth <smyth at wehi.edu.au
>>                             <mailto:smyth at wehi.edu.au>>
>>                             Cc: Bioconductor mailing list
>>                             <bioconductor at r-project.org
>>                             <mailto:bioconductor at r-project.org>>
>>                             Subject: Re: [BioC] Remove batch effect in
>>                             small RNASeq study (SVA or
>>                                      others?)
>>
>>                             Dear Dr. Smyth,
>>
>>                             Thank you very much for your quick reply.
>>                             I did as you suggested by
>>                             first
>>                             getting log CPM value, then call
>>                             removeBatchEffect(). I found the PCA
>>                             figure looks better than before, but there
>>                             is still a batch effect.
>> 
>>
>>                         I don't see how there could still be a batch
>>                         effect.
>>
>>                         Please give the code sequence you used to
>>                         remove the batch effect and to
>>                         make the PCA plot.
>>
>>                           I attached two PCA figures. One is based on
>>                         log10(raw count) which is
>>
>>                             before calling cpm() and
>>                             removeBatchEffect(). Another one is after.
>>
>>                             Could you look at them and give me more
>>                             suggestions. Will a quantile
>>                             normalization across samples be a good
>>                             idea since CPM() is still a
>>                             normalization only within each sample??
>> 
>>
>>                         You should normalize the data before using
>>                         removeBatchEffect(), and
>>                         quantile is one possibility.
>>
>>                         Gordon
>>
>>                           Thanks again for your help,
>>
>>                             Shirley
>> 
>>
>>                             On Sun, Apr 27, 2014 at 6:54 PM, Gordon K
>>                             Smyth <smyth at wehi.edu.au
>>                             <mailto:smyth at wehi.edu.au>>
>>                             wrote:
>>
>>                               Dear Shirley,
>> 
>>
>>                                 I would probably do it like this:
>>
>>                                    library(edgeR)
>>                                    logCPM <- cpm(y,log=TRUE,prior.count=5)
>>                                    logCPM <- removeBatchEffect(logCPM,
>>                                 batch=batch)
>>
>>                                 Best wishes
>>                                 Gordon
>>
>>                                   Date: Sat, 26 Apr 2014 10:51:23 -0400
>>
>>                                   From: shirley zhang
>>                                 <shirley0818 at gmail.com
>>                                 <mailto:shirley0818 at gmail.com>>
>>
>>                                     To: Bioconductor Mailing List
>>                                     <bioconductor at stat.math.ethz.__ch
>>                                     <mailto:bioconductor at stat.math.ethz.ch>>
>>                                     Subject: [BioC] Remove batch
>>                                     effect in small RNASeq study (SVA or
>>                                              others?)
>>
>>                                     I have a RNASeq paired-end data
>>                                     from two batches (8 samples from
>>                                     batch1,
>>                                     and 7 samples from batch2). After
>>                                     alignment using TopHat2, then I got
>>                                     count
>>                                     using HTseq-count, and FPKM value
>>                                     via Cufflinks. A big batch effect
>>                                     can
>>                                     be
>>                                     viewed in PCA using both log10(raw
>>                                     count) and log10(FPKM) value.
>>
>>                                     I can NOT use the block factor in
>>                                     edgeR to remove batch effect since
>>                                     I
>>                                     need
>>                                     to first obtain residuals after
>>                                     adjusting for batch effect, then test
>>                                     the
>>                                     residuals for hundreds of
>>                                     thousands of SNPs (eQTL analysis).
>>
>>                                     My question is how to remove batch
>>                                     effect without using edgeR:
>>
>>                                     1. is SVA ok for such a small
>>                                     sample size (N=15)?
>>                                     2. If SVA does not work, any other
>>                                     suggestions?
>>
>>                                     Many thanks,
>>                                     Shirley

______________________________________________________________________
The information in this email is confidential and intend...{{dropped:4}}