[BioC] edgeR: confusing BCV plot

Wed Sep 12 15:29:08 CEST 2012

Hi Natasha,

On 9/12/2012 5:05 AM, Natasha Sahgal wrote:
> Dear List,
>
> Sorry for any cross-posting.
>
> I have an RNA-Seq expt for which I'd like to use edgeR, as it is multifactorial in design.
> The expt: 2 cell-lines (ko,wt), 2 conditions(stimulated, unstimulated), n=2 in each group.
> My aim: to detect DE genes based on the effect of stimulus on ko cells.
>
> However, I do not understand the output from the BCV plot after estimating dispersion.
> Thus would appreciate any help/advice/suggestions.

What about the output do you not understand?

>
> Also, would appreciate comments on the filtering step! As, it appears to me that I still have some genes with 0 read counts (as seen under the normalisation section).

You didn't filter in such a way to remove all genes with zero counts. In 
fact you didn't say anything about zero counts - instead what you did 
was to require at least four samples to have more than 1 or 2 or 3 
counts per million. The other four samples were unconstrained, and could 
easily have had zero counts.

Note that this is a reasonable thing to do. You are looking for genes 
where one sample had more transcripts than the other sample. This 
includes the situation where one of the samples appears not to 
transcribe the gene at all.

Best,

Jim

>
> ------------------------------------
> Code:
> dim(gene.counts.2) # 33607     8
> ## Sample Descriptions
> group = factor(gsub("\\_[[:digit:]]","",colnames(gene.counts.2)))
>
> ## Creating dge list
> y = DGEList(counts=gene.counts.2, group=group)
>
> ## Filtering
> keep = rowSums(cpm(y)>1)>= 4
> table(keep)
> #FALSE  TRUE
> #17678 15929
> keep2 = rowSums(cpm(y)>2)>= 4
> table(keep2)
> #FALSE  TRUE
> #19300 14307
> keep3 = rowSums(cpm(y)>3)>= 4
> table(keep3)
> #FALSE  TRUE
> #20229 13378
>
> y.filt = y[keep, ]
> dim(y.filt$counts)  # 15929     8
>
> y.filt2 = y[keep2, ]
> dim(y.filt2$counts) # 14307     8
>
> y.filt3 = y[keep3, ]
> dim(y.filt3$counts) # 13378     8
>
> ## Recalculate lib.size
> y.filt$samples$lib.size = colSums(y.filt$counts) y.filt2$samples$lib.size = colSums(y.filt2$counts) y.filt3$samples$lib.size = colSums(y.filt3$counts)
>
> ## Normalisation
> y.filt = calcNormFactors(y.filt)
>
> range(y.filt$counts[,1]) #  0 159659
> range(y.filt$counts[,2]) #  0 155390
> range(y.filt$counts[,3]) #  0 122249
> range(y.filt$counts[,4]) #  0 137046
> range(y.filt$counts[,5]) #  0 206528
> range(y.filt$counts[,6]) #  0 222176
> range(y.filt$counts[,7]) #  0 192333
> range(y.filt$counts[,8]) #  0 229413
>
> y.filt2 = calcNormFactors(y.filt2)
>
> range(y.filt2$counts[,1]) #  0 159659
> range(y.filt2$counts[,2]) #  0 155390
> range(y.filt2$counts[,3]) #  0 122249
> range(y.filt2$counts[,4]) #  0 137046
> range(y.filt2$counts[,5]) #  0 206528
> range(y.filt2$counts[,6]) #  0 222176
> range(y.filt2$counts[,7]) #  0 192333
> range(y.filt2$counts[,8]) #  0 229413
>
> y.filt3 = calcNormFactors(y.filt3)
>
> range(y.filt3$counts[,1]) #  0 159659
> range(y.filt3$counts[,2]) #  0 155390
> range(y.filt3$counts[,3]) #  0 122249
> range(y.filt3$counts[,4]) #  0 137046
> range(y.filt3$counts[,5]) #  0 206528
> range(y.filt3$counts[,6]) #  0 222176
> range(y.filt3$counts[,7]) #  0 192333
> range(y.filt3$counts[,8]) #  0 229413
>
> ## MDS plots
> plotMDS(y.filt, main="cpm(y)>1")
> plotMDS(y.filt2, main="cpm(y)>2")
> plotMDS(y.filt3, main="cpm(y)>3")
>
> ## Design Matrix
> design = model.matrix(~0+group)
> colnames(design) = gsub("group","",colnames(design)) design #  KO KO_stim WT WT_stim
> #1  1       0  0       0
> #2  1       0  0       0
> #3  0       1  0       0
> #4  0       1  0       0
> #5  0       0  1       0
> #6  0       0  1       0
> #7  0       0  0       1
> #8  0       0  0       1
>
> ## Estimating Dispersion
> y.filt = estimateGLMCommonDisp(y.filt, design, verbose=T) #Disp = 0.0276 , BCV = 0.1661
> y.filt2 = estimateGLMCommonDisp(y.filt2, design, verbose=T) #Disp = 0.02711 , BCV = 0.1646
> y.filt3 = estimateGLMCommonDisp(y.filt3, design, verbose=T) #Disp = 0.02665 , BCV = 0.1632
>
>
> y.filt = estimateGLMTrendedDisp(y.filt,design)
> y.filt2 = estimateGLMTrendedDisp(y.filt2,design)
> y.filt3 = estimateGLMTrendedDisp(y.filt3,design)
>
> y.filt = estimateGLMTagwiseDisp(y.filt,design)
> y.filt2 = estimateGLMTagwiseDisp(y.filt2,design)
> y.filt3 = estimateGLMTagwiseDisp(y.filt3,design)
>
> jpeg("BCVplots.jpg",height=500,width=900)
> par(mfrow=c(1,3))
> plotBCV(y.filt, main="cpm(y)>1")
> plotBCV(y.filt2, main="cpm(y)>2")
> plotBCV(y.filt3, main="cpm(y)>3")
> dev.off()
> (http://www.well.ox.ac.uk/~nsahgal/test/BCVplots.jpg)
>
> ### NOT RUN this section
> ## Fit Model
> fit = glmFit(y.filt, design)
> cont = makeContrasts((KO_stim - KO) - (WT_stim - WT), levels=design)
> lrt = glmLRT(y.filt, fit, contrast=as.vector(cont))
> ------------------------------------
>
> sessionInfo()
> R version 2.15.0 (2012-03-30)
> Platform: x86_64-pc-linux-gnu (64-bit)
>
> locale:
>   [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C
>   [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8
>   [5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8
>   [7] LC_PAPER=C                 LC_NAME=C
>   [9] LC_ADDRESS=C               LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C
>
> attached base packages:
> [1] splines   stats     graphics  grDevices utils     datasets  methods
> [8] base
>
> other attached packages:
> [1] scatterplot3d_0.3-33 DESeq_1.8.3          locfit_1.5-7
> [4] Biobase_2.16.0       BiocGenerics_0.2.0   WriteXLS_2.1.0
> [7] edgeR_2.6.10         limma_3.12.0
>
> loaded via a namespace (and not attached):
>   [1] annotate_1.34.0      AnnotationDbi_1.18.0 DBI_0.2-5
>   [4] genefilter_1.38.0    geneplotter_1.34.0   grid_2.15.0
>   [7] IRanges_1.14.2       lattice_0.20-6       RColorBrewer_1.0-5
> [10] RSQLite_0.11.1       stats4_2.15.0        survival_2.36-14
> [13] tools_2.15.0         xtable_1.7-0
> ------------------------------------
>
> Many Thanks,
> Natasha
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

-- 
James W. MacDonald, M.S.
Biostatistician
University of Washington
Environmental and Occupational Health Sciences
4225 Roosevelt Way NE, # 100
Seattle WA 98105-6099