[BioC] GSEA and edgeR

Gordon K Smyth smyth at wehi.EDU.AU
Fri Sep 2 02:17:03 CEST 2011


Dear Iain,

Before I attempt a fuller reply, can you tell me what gene sets you are 
planning to use for this analysis?  Are you planning to use a large 
collection of sets, like the MSigDB or similar, or do you have a small 
number of gene sets of special relevance to your experiment?

Best wishes
Gordon

---------------------------------------------
Professor Gordon K Smyth,
Bioinformatics Division,
Walter and Eliza Hall Institute of Medical Research,
1G Royal Parade, Parkville, Vic 3052, Australia.
smyth at wehi.edu.au
http://www.wehi.edu.au
http://www.statsci.org/smyth

On Thu, 1 Sep 2011, Iain Gallagher wrote:

Dear List / Gordon

I would like to carry out a GSEA analysis on some RNA-Seq data. I have 
analysed the data using edgeR and have a list of regulated genes. The data 
is paired - 6 biological samples, cells from each are infected or 
uninfected.

Looking around there is little advice on the use of GSEA in RNA-Seq data. 
Using edgeR (and having a relatively small sample size) I was hoping to 
make use of the romer algorithm which is implemented in limma. However I 
am having a conceptual problem setting up my data for this.

As the study uses paired sample data I have followed the guidance for this 
type of analysis in the marvellous edgeR manual. Searching for advice on 
how to conduct GSEA I came across this post 
(http://www.mail-archive.com/bioc-sig-sequencing@r-project.org/msg02020.html) 
by Gordon Smyth (author of edgeR / limma) wherein he describes a suitable 
strategy for edgeR and the rotational GSEA algorithms.

These algorithms require the specification of a contrast (in my case 
between infected and uninfected animals).

e.g. test <- romer(geneIndex, countData, design=design, contrast=contr[,1], nrot=9999)

My design matrix looks like this:

library(limma)

samples <- rep(c('s1', 's2', 's3', 's4', 's5', 's6'),2)
infection <- c(rep(1,6), rep(2,6))

s <- as.factor(samples)
i <- as.factor(infection)

design <- model.matrix(~s+i)
colnames(design)[7] <- 'infection'

> design
    (Intercept) ss2 ss3 ss4 ss5 ss6 infection
1            1   0   0   0   0   0         0
2            1   1   0   0   0   0         0
3            1   0   1   0   0   0         0
4            1   0   0   1   0   0         0
5            1   0   0   0   1   0         0
6            1   0   0   0   0   1         0
7            1   0   0   0   0   0         1
8            1   1   0   0   0   0         1
9            1   0   1   0   0   0         1
10           1   0   0   1   0   0         1
11           1   0   0   0   1   0         1
12           1   0   0   0   0   1         1
attr(,"assign")
[1] 0 1 1 1 1 1 2
attr(,"contrasts")
attr(,"contrasts")$s
[1] "contr.treatment"

attr(,"contrasts")$i
[1] "contr.treatment"

and in edgeR I carry out the following fit of a GLM to get the genes 
regulated between infected and uninfected animals i.e. fit a GLM for 
infection and no infection then edgeR carries out likeliehood tests to 
find the genes where the two models differ (I think!).

#dispersion estimate
d <- estimateGLMCommonDisp(d, design)

#fit the NB GLM for each gene
fitFiltered <- glmFit(d, design, dispersion = d$common.dispersion)

#carry out the likliehood ratio test
lrtFiltered <- glmLRT(d, fitFiltered, coef = 7)

#how many genes are DE?
summary(decideTestsDGE(lrtFiltered))#DE genes

So (finally) my question is how do I set up a contrast (whilst keeping the pairing) for romer?

Thanks

Best

Iain

> sessionInfo()
R version 2.13.1 (2011-07-08)
Platform: x86_64-pc-linux-gnu (64-bit)

locale:
  [1] LC_CTYPE=en_GB.utf8       LC_NUMERIC=C
  [3] LC_TIME=en_GB.utf8        LC_COLLATE=en_GB.utf8
  [5] LC_MONETARY=C             LC_MESSAGES=en_GB.utf8
  [7] LC_PAPER=en_GB.utf8       LC_NAME=C
  [9] LC_ADDRESS=C              LC_TELEPHONE=C 
[11] LC_MEASUREMENT=en_GB.utf8 LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
  [1] annotate_1.30.0        GO.db_2.5.0            goseq_1.4.0
  [4] geneLenDataBase_0.99.7 BiasedUrn_1.04         org.Bt.eg.db_2.5.0
  [7] RSQLite_0.9-4          DBI_0.2-5              AnnotationDbi_1.14.1 
[10] Biobase_2.12.2         edgeR_2.2.5            limma_3.8.3

loaded via a namespace (and not attached):
  [1] biomaRt_2.8.1         Biostrings_2.20.3     BSgenome_1.20.0
  [4] GenomicFeatures_1.4.4 GenomicRanges_1.4.8   grid_2.13.1
  [7] IRanges_1.10.6        lattice_0.19-33       Matrix_0.9996875-3 
[10] mgcv_1.7-6            nlme_3.1-102          RCurl_1.6-9 
[13] rtracklayer_1.12.4    tools_2.13.1          XML_3.4-2 
[16] xtable_1.5-6 
>
>

______________________________________________________________________
The information in this email is confidential and intend...{{dropped:4}}



More information about the Bioconductor mailing list