[BioC] edgeR multifactorial design: confusing BCV plot
Natasha Sahgal
nsahgal at well.ox.ac.uk
Tue Sep 11 19:25:57 CEST 2012
Dear List,
This is a follow up from my previous post:
https://stat.ethz.ch/pipermail/bioconductor/2012-July/047173.htmlhttps://stat.ethz.ch/pipermail/bioconductor/2012-July/047173.html
As I finally have the count data, started with the analysis.
However, I do not understand the output from the BCV plot after estimating dispersion.
Thus would appreciate any help/advice/suggestions.
Also, would appreciate comments on the filtering step! As, it appears to me that I still have some genes with 0 read counts (as seen under the normalisation section).
------------------------------------
Code:
dim(gene.counts.2) # 33607 8
## Sample Descriptions
group = factor(gsub("\\_[[:digit:]]","",colnames(gene.counts.2)))
## Creating dge list
y = DGEList(counts=gene.counts.2, group=group)
## Filtering
keep = rowSums(cpm(y)>1) >= 4
table(keep)
#FALSE TRUE
#17678 15929
keep2 = rowSums(cpm(y)>2) >= 4
table(keep2)
#FALSE TRUE
#19300 14307
keep3 = rowSums(cpm(y)>3) >= 4
table(keep3)
#FALSE TRUE
#20229 13378
y.filt = y[keep, ]
dim(y.filt$counts) # 15929 8
y.filt2 = y[keep2, ]
dim(y.filt2$counts) # 14307 8
y.filt3 = y[keep3, ]
dim(y.filt3$counts) # 13378 8
## Recalculate lib.size
y.filt$samples$lib.size = colSums(y.filt$counts)
y.filt2$samples$lib.size = colSums(y.filt2$counts)
y.filt3$samples$lib.size = colSums(y.filt3$counts)
## Normalisation
y.filt = calcNormFactors(y.filt)
range(y.filt$counts[,1]) # 0 159659
range(y.filt$counts[,2]) # 0 155390
range(y.filt$counts[,3]) # 0 122249
range(y.filt$counts[,4]) # 0 137046
range(y.filt$counts[,5]) # 0 206528
range(y.filt$counts[,6]) # 0 222176
range(y.filt$counts[,7]) # 0 192333
range(y.filt$counts[,8]) # 0 229413
y.filt2 = calcNormFactors(y.filt2)
range(y.filt2$counts[,1]) # 0 159659
range(y.filt2$counts[,2]) # 0 155390
range(y.filt2$counts[,3]) # 0 122249
range(y.filt2$counts[,4]) # 0 137046
range(y.filt2$counts[,5]) # 0 206528
range(y.filt2$counts[,6]) # 0 222176
range(y.filt2$counts[,7]) # 0 192333
range(y.filt2$counts[,8]) # 0 229413
y.filt3 = calcNormFactors(y.filt3)
range(y.filt3$counts[,1]) # 0 159659
range(y.filt3$counts[,2]) # 0 155390
range(y.filt3$counts[,3]) # 0 122249
range(y.filt3$counts[,4]) # 0 137046
range(y.filt3$counts[,5]) # 0 206528
range(y.filt3$counts[,6]) # 0 222176
range(y.filt3$counts[,7]) # 0 192333
range(y.filt3$counts[,8]) # 0 229413
## MDS plots
plotMDS(y.filt, main="cpm(y)>1")
plotMDS(y.filt2, main="cpm(y)>2")
plotMDS(y.filt3, main="cpm(y)>3")
## Design Matrix
design = model.matrix(~0+group)
colnames(design) = gsub("group","",colnames(design))
design
# KO KO_stim WT WT_stim
#1 1 0 0 0
#2 1 0 0 0
#3 0 1 0 0
#4 0 1 0 0
#5 0 0 1 0
#6 0 0 1 0
#7 0 0 0 1
#8 0 0 0 1
## Estimating Dispersion
y.filt = estimateGLMCommonDisp(y.filt, design, verbose=T)
#Disp = 0.0276 , BCV = 0.1661
y.filt2 = estimateGLMCommonDisp(y.filt2, design, verbose=T)
#Disp = 0.02711 , BCV = 0.1646
y.filt3 = estimateGLMCommonDisp(y.filt3, design, verbose=T)
#Disp = 0.02665 , BCV = 0.1632
y.filt = estimateGLMTrendedDisp(y.filt,design)
y.filt2 = estimateGLMTrendedDisp(y.filt2,design)
y.filt3 = estimateGLMTrendedDisp(y.filt3,design)
y.filt = estimateGLMTagwiseDisp(y.filt,design)
y.filt2 = estimateGLMTagwiseDisp(y.filt2,design)
y.filt3 = estimateGLMTagwiseDisp(y.filt3,design)
jpeg("BCVplots.jpg",height=500,width=900)
par(mfrow=c(1,3))
plotBCV(y.filt, main="cpm(y)>1")
plotBCV(y.filt2, main="cpm(y)>2")
plotBCV(y.filt3, main="cpm(y)>3")
dev.off()
### NOT RUN this section fully
## Fit Model
fit = glmFit(y.filt, design)
lrt = glmLRT(y.filt, fit, contrast=(KO_stim - KO) - (WT_stim - WT)) ### this does not work on testing, so I think I have not correctly defined the contrast parameter
------------------------------------
sessionInfo()
R version 2.15.0 (2012-03-30)
Platform: x86_64-pc-linux-gnu (64-bit)
locale:
[1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8
[5] LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_GB.UTF-8
[7] LC_PAPER=C LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] splines stats graphics grDevices utils datasets methods
[8] base
other attached packages:
[1] scatterplot3d_0.3-33 DESeq_1.8.3 locfit_1.5-7
[4] Biobase_2.16.0 BiocGenerics_0.2.0 WriteXLS_2.1.0
[7] edgeR_2.6.10 limma_3.12.0
loaded via a namespace (and not attached):
[1] annotate_1.34.0 AnnotationDbi_1.18.0 DBI_0.2-5
[4] genefilter_1.38.0 geneplotter_1.34.0 grid_2.15.0
[7] IRanges_1.14.2 lattice_0.20-6 RColorBrewer_1.0-5
[10] RSQLite_0.11.1 stats4_2.15.0 survival_2.36-14
[13] tools_2.15.0 xtable_1.7-0
------------------------------------
Many Thanks,
Natasha
More information about the Bioconductor
mailing list