[BioC] DEXSeq many exons marked as not testable
Stefan Dentro
sd11 at sanger.ac.uk
Wed Sep 19 13:02:41 CEST 2012
Hello,
I'm using DEXSeq to obtain differentially expressed exons between my 15
samples and 8 controls. But for some reason most exons are not testable
and are not assigned a p-value. Do you know what I'm doing wrong?
First I've created a conservative list of exons per gene in the human
genome through the canonical transcripts table in UCSC (194511 in
total). Next I've obtained read counts for each exon in each sample
which are read into one big data.frame. Then the following list of
commands are executed:
metadata = data.frame(
row.names=colnames(dat)[7:29],
condition=c(rep("control", 8), rep("data", 15)),
replicate=c(c(1:8), c(1:15)),
libType=rep("paired-end", 23))
## Add strand information that is not available in the read counts
data.frame
dat = cbind(dat, strand=rep(NA, nrow(dat)))
cds = newExonCountSet(countData = dat[,c(7:29)], design = metadata,
geneIDs = dat[,5], exonIDs = dat[,6],
exonIntervals=dat[,c(2,3,4, 30)])
cds = estimateSizeFactors(cds)
cds = estimateDispersions(cds, minCount=2, maxExon=90)
cds = fitDispersionFunction(cds)
cds = testForDEU(cds)
cds = estimatelog2FoldChanges(cds)
Now lets see how many exons are expressed significantly different
between data and controls. Surprisingly only 1850 exons are described
here, not the full 194511:
res1 = DEUresultTable(cds)
table(res1$padjust < 0.05)
FALSE TRUE
1424 426
So I've zoomed in to an example gene. Below it can be seen that indeed
all exons are marked as not testable. But looking at the individual read
counts per sample (see further below) it does not make sense that they
are not testable. Some exons indeed have a low read count and are
excluded rightfully, but others have enough reads to base dispersion on
I would think.
If I have understood correctly testable can be false in either of these
cases:
- Gene has only one exon, dispersion cannot be calculated.
- Readcounts for an exon are too low.
- A combination of the above.
But all do not apply here. Any ideas?
fData for the gene:
geneID exonID testable dispBeforeSharing dispFitted
dispersion pvalue padjust chr start end strand transcripts
136687 ENSG00000009780 1 FALSE NA 0.2026585
0.2026585 NA NA 1 28052568 28052672 <NA> <NA>
108973 ENSG00000009780 2 FALSE NA 0.9549670
0.9549670 NA NA 1 28053983 28054047 <NA> <NA>
1100 ENSG00000009780 3 FALSE NA 1.9560912
1.9560912 NA NA 1 28056743 28056844 <NA> <NA>
1 ENSG00000009780 4 FALSE NA 0.4722995
0.4722995 NA NA 1 28059114 28059168 <NA> <NA>
22096 ENSG00000009780 5 FALSE NA 0.1146813
0.1146813 NA NA 1 28060542 28060694 <NA> <NA>
13670 ENSG00000009780 6 FALSE NA 0.1127563
0.1127563 NA NA 1 28071165 28071322 <NA> <NA>
13666 ENSG00000009780 7 FALSE NA 0.1450013
0.1450013 NA NA 1 28075579 28075665 <NA> <NA>
22095 ENSG00000009780 8 FALSE NA 0.1160550
0.1160550 NA NA 1 28081706 28081841 <NA> <NA>
13665 ENSG00000009780 9 FALSE NA 0.1129432
0.1129432 NA NA 1 28086037 28086138 <NA> <NA>
138629 ENSG00000009780 10 FALSE NA 0.1112855
0.1112855 NA NA 1 28087006 28087092 <NA> <NA>
Read counts for the gene per sample:
exon_id chr start end gene_id exon_nr
sample1 sample2 sample3 sample4 sample5 sample6 sample7 sample8
1 ENSE00000327880 1 28059114 28059168 ENSG00000009780 4
11 8 4 7 3 1 8 3
1100 ENSE00000571221 1 28056743 28056844 ENSG00000009780 3
1 0 1 1 0 0 0 1
13665 ENSE00000761751 1 28086037 28086138 ENSG00000009780 9
91 145 86 47 169 36 231 136
13666 ENSE00000761753 1 28075579 28075665 ENSG00000009780 7
19 62 32 18 44 23 61 38
13670 ENSE00000761780 1 28071165 28071322 ENSG00000009780 6
77 139 73 66 78 30 168 88
22095 ENSE00000866714 1 28081706 28081841 ENSG00000009780 8
78 94 52 47 40 41 121 60
22096 ENSE00000866716 1 28060542 28060694 ENSG00000009780 5
36 61 50 63 67 10 129 69
108973 ENSE00001625313 1 28053983 28054047 ENSG00000009780 2
3 5 3 0 1 0 4 2
136687 ENSE00001909353 1 28052568 28052672 ENSG00000009780 1
2 12 5 0 19 0 34 10
138629 ENSE00001955917 1 28087006 28087092 ENSG00000009780 10
42 96 50 24 98 23 117 79
sample9 sample10 sample11 sample12 sample13 sample14 sample15
sample16 sample17 sample18 sample19 sample20 sample21 sample22
1 1 1 0 3 4 1 1
0 3 7 0 0 1 0
1100 2 0 0 0 0 1 1
0 0 3 1 0 0 0
13665 78 64 130 55 66 27 75
53 49 126 36 37 48 53
13666 7 20 51 6 43 8 15
19 11 26 7 15 11 5
13670 75 85 186 79 86 58 70
86 47 149 82 49 66 38
22095 64 64 149 51 91 37 67
54 58 148 54 41 64 43
22096 90 79 287 66 97 29 75
92 58 126 67 55 71 51
108973 1 0 0 1 1 0 1
1 0 4 0 0 0 0
136687 5 28 31 6 27 5 4
12 5 32 6 7 9 16
138629 119 111 233 100 129 59 141
103 70 219 92 60 74 67
sample23 strand
1 0 NA
1100 0 NA
13665 24 NA
13666 7 NA
13670 27 NA
22095 25 NA
22096 21 NA
108973 1 NA
136687 1 NA
138629 54 NA
Design of the experiment:
condition replicate libType
sample1 control 1 paired-end
sample2 control 2 paired-end
sample3 control 3 paired-end
sample4 control 4 paired-end
sample5 control 5 paired-end
sample6 control 6 paired-end
sample7 control 7 paired-end
sample8 control 8 paired-end
sample9 data 1 paired-end
sample10 data 2 paired-end
sample11 data 3 paired-end
sample12 data 4 paired-end
sample13 data 5 paired-end
sample14 data 6 paired-end
sample15 data 7 paired-end
sample16 data 8 paired-end
sample17 data 9 paired-end
sample18 data 10 paired-end
sample19 data 11 paired-end
sample20 data 12 paired-end
sample21 data 13 paired-end
sample22 data 14 paired-end
sample23 data 15 paired-end
> sessionInfo()
R version 2.15.0 (2012-03-30)
Platform: i686-pc-linux-gnu (32-bit)
locale:
[1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C LC_TIME=en_GB.UTF-8
LC_COLLATE=en_GB.UTF-8 LC_MONETARY=en_GB.UTF-8
[6] LC_MESSAGES=C LC_PAPER=C LC_NAME=C
LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] plyr_1.7.1 DEXSeq_1.2.1 BiocInstaller_1.4.3
DESeq_1.8.3 locfit_1.5-7 Biobase_2.16.0
[7] BiocGenerics_0.2.0
loaded via a namespace (and not attached):
[1] annotate_1.34.0 AnnotationDbi_1.18.0 biomaRt_2.12.0
DBI_0.2-5 genefilter_1.38.0 geneplotter_1.34.0
[7] grid_2.15.0 hwriter_1.3 IRanges_1.14.2
lattice_0.20-6 RColorBrewer_1.0-5 RCurl_1.91-1
[13] RSQLite_0.11.1 splines_2.15.0 statmod_1.4.15
stats4_2.15.0 stringr_0.6 survival_2.36-12
[19] tools_2.15.0 XML_3.9-4 xtable_1.7-0
--
The Wellcome Trust Sanger Institute is operated by Genome Research
Limited, a charity registered in England with number 1021457 and a
company registered in England with number 2742969, whose registered
office is 215 Euston Road, London, NW1 2BE.
More information about the Bioconductor
mailing list