[BioC] edgeR: confusing BCV plot

Wed Sep 12 15:43:25 CEST 2012

Dear Jim,

Regarding the BCV plots, what I did not understand was the strange profile (at least strange to me), and the low coefficients of BV.
Based on some figures from the user guide, it appeared to be very different - an increase towards the higher logCPM. 
1) I'm not sure how to interpret these and if it is a good thing or not? (perhaps I have misunderstood the concept of the BCV)
2) How does this affect the differential expression, if at all?

Re the filtering, for some reason I was under the impression increasing the counts per million would reduce (if not remove) zero counts in all samples. I agree with what you say about half the samples being unconstrained. 

I had 3 filters here, just to see what the difference would be. I still need to figure out the best or optimal one to use.

Many Thanks and Best Wishes,
Natasha

-----Original Message-----
From: James W. MacDonald [mailto:jmacdon at uw.edu] 
Sent: 12 September 2012 14:29
To: Natasha Sahgal
Cc: 'bioconductor at r-project.org'
Subject: Re: [BioC] edgeR: confusing BCV plot

Hi Natasha,

On 9/12/2012 5:05 AM, Natasha Sahgal wrote:
> Dear List,
>
> Sorry for any cross-posting.
>
> I have an RNA-Seq expt for which I'd like to use edgeR, as it is multifactorial in design.
> The expt: 2 cell-lines (ko,wt), 2 conditions(stimulated, unstimulated), n=2 in each group.
> My aim: to detect DE genes based on the effect of stimulus on ko cells.
>
> However, I do not understand the output from the BCV plot after estimating dispersion.
> Thus would appreciate any help/advice/suggestions.

What about the output do you not understand?

>
> Also, would appreciate comments on the filtering step! As, it appears to me that I still have some genes with 0 read counts (as seen under the normalisation section).

You didn't filter in such a way to remove all genes with zero counts. In fact you didn't say anything about zero counts - instead what you did was to require at least four samples to have more than 1 or 2 or 3 counts per million. The other four samples were unconstrained, and could easily have had zero counts.

Note that this is a reasonable thing to do. You are looking for genes where one sample had more transcripts than the other sample. This includes the situation where one of the samples appears not to transcribe the gene at all.

Best,

Jim

>
> ------------------------------------
> Code:
> dim(gene.counts.2) # 33607     8
> ## Sample Descriptions
> group = factor(gsub("\\_[[:digit:]]","",colnames(gene.counts.2)))
>
> ## Creating dge list
> y = DGEList(counts=gene.counts.2, group=group)
>
> ## Filtering
> keep = rowSums(cpm(y)>1)>= 4
> table(keep)
> #FALSE  TRUE
> #17678 15929
> keep2 = rowSums(cpm(y)>2)>= 4
> table(keep2)
> #FALSE  TRUE
> #19300 14307
> keep3 = rowSums(cpm(y)>3)>= 4
> table(keep3)
> #FALSE  TRUE
> #20229 13378
>
> y.filt = y[keep, ]
> dim(y.filt$counts)  # 15929     8
>
> y.filt2 = y[keep2, ]
> dim(y.filt2$counts) # 14307     8
>
> y.filt3 = y[keep3, ]
> dim(y.filt3$counts) # 13378     8
>
> ## Recalculate lib.size
> y.filt$samples$lib.size = colSums(y.filt$counts) 
> y.filt2$samples$lib.size = colSums(y.filt2$counts) 
> y.filt3$samples$lib.size = colSums(y.filt3$counts)
>
> ## Normalisation
> y.filt = calcNormFactors(y.filt)
>
> range(y.filt$counts[,1]) #  0 159659
> range(y.filt$counts[,2]) #  0 155390
> range(y.filt$counts[,3]) #  0 122249
> range(y.filt$counts[,4]) #  0 137046
> range(y.filt$counts[,5]) #  0 206528
> range(y.filt$counts[,6]) #  0 222176
> range(y.filt$counts[,7]) #  0 192333
> range(y.filt$counts[,8]) #  0 229413
>
> y.filt2 = calcNormFactors(y.filt2)
>
> range(y.filt2$counts[,1]) #  0 159659
> range(y.filt2$counts[,2]) #  0 155390
> range(y.filt2$counts[,3]) #  0 122249
> range(y.filt2$counts[,4]) #  0 137046
> range(y.filt2$counts[,5]) #  0 206528
> range(y.filt2$counts[,6]) #  0 222176
> range(y.filt2$counts[,7]) #  0 192333
> range(y.filt2$counts[,8]) #  0 229413
>
> y.filt3 = calcNormFactors(y.filt3)
>
> range(y.filt3$counts[,1]) #  0 159659
> range(y.filt3$counts[,2]) #  0 155390
> range(y.filt3$counts[,3]) #  0 122249
> range(y.filt3$counts[,4]) #  0 137046
> range(y.filt3$counts[,5]) #  0 206528
> range(y.filt3$counts[,6]) #  0 222176
> range(y.filt3$counts[,7]) #  0 192333
> range(y.filt3$counts[,8]) #  0 229413
>
> ## MDS plots
> plotMDS(y.filt, main="cpm(y)>1")
> plotMDS(y.filt2, main="cpm(y)>2")
> plotMDS(y.filt3, main="cpm(y)>3")
>
> ## Design Matrix
> design = model.matrix(~0+group)
> colnames(design) = gsub("group","",colnames(design)) design #  KO KO_stim WT WT_stim
> #1  1       0  0       0
> #2  1       0  0       0
> #3  0       1  0       0
> #4  0       1  0       0
> #5  0       0  1       0
> #6  0       0  1       0
> #7  0       0  0       1
> #8  0       0  0       1
>
> ## Estimating Dispersion
> y.filt = estimateGLMCommonDisp(y.filt, design, verbose=T) #Disp = 
> 0.0276 , BCV = 0.1661
> y.filt2 = estimateGLMCommonDisp(y.filt2, design, verbose=T) #Disp = 
> 0.02711 , BCV = 0.1646
> y.filt3 = estimateGLMCommonDisp(y.filt3, design, verbose=T) #Disp = 
> 0.02665 , BCV = 0.1632
>
>
> y.filt = estimateGLMTrendedDisp(y.filt,design)
> y.filt2 = estimateGLMTrendedDisp(y.filt2,design)
> y.filt3 = estimateGLMTrendedDisp(y.filt3,design)
>
> y.filt = estimateGLMTagwiseDisp(y.filt,design)
> y.filt2 = estimateGLMTagwiseDisp(y.filt2,design)
> y.filt3 = estimateGLMTagwiseDisp(y.filt3,design)
>
> jpeg("BCVplots.jpg",height=500,width=900)
> par(mfrow=c(1,3))
> plotBCV(y.filt, main="cpm(y)>1")
> plotBCV(y.filt2, main="cpm(y)>2")
> plotBCV(y.filt3, main="cpm(y)>3")
> dev.off()
> (http://www.well.ox.ac.uk/~nsahgal/test/BCVplots.jpg)
>
> ### NOT RUN this section
> ## Fit Model
> fit = glmFit(y.filt, design)
> cont = makeContrasts((KO_stim - KO) - (WT_stim - WT), levels=design) 
> lrt = glmLRT(y.filt, fit, contrast=as.vector(cont))
> ------------------------------------
>
> sessionInfo()
> R version 2.15.0 (2012-03-30)
> Platform: x86_64-pc-linux-gnu (64-bit)
>
> locale:
>   [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C
>   [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8
>   [5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8
>   [7] LC_PAPER=C                 LC_NAME=C
>   [9] LC_ADDRESS=C               LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C
>
> attached base packages:
> [1] splines   stats     graphics  grDevices utils     datasets  methods
> [8] base
>
> other attached packages:
> [1] scatterplot3d_0.3-33 DESeq_1.8.3          locfit_1.5-7
> [4] Biobase_2.16.0       BiocGenerics_0.2.0   WriteXLS_2.1.0
> [7] edgeR_2.6.10         limma_3.12.0
>
> loaded via a namespace (and not attached):
>   [1] annotate_1.34.0      AnnotationDbi_1.18.0 DBI_0.2-5
>   [4] genefilter_1.38.0    geneplotter_1.34.0   grid_2.15.0
>   [7] IRanges_1.14.2       lattice_0.20-6       RColorBrewer_1.0-5
> [10] RSQLite_0.11.1       stats4_2.15.0        survival_2.36-14
> [13] tools_2.15.0         xtable_1.7-0
> ------------------------------------
>
> Many Thanks,
> Natasha
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: 
> http://news.gmane.org/gmane.science.biology.informatics.conductor

--
James W. MacDonald, M.S.
Biostatistician
University of Washington
Environmental and Occupational Health Sciences
4225 Roosevelt Way NE, # 100
Seattle WA 98105-6099