[BioC] edgeR design matrix, one group vs average of other groups
Georg Otto
georg.otto at imm.ox.ac.uk
Fri Mar 14 20:26:08 CET 2014
Thanks a lot, Yunshun and Ryan for your informative answers. I
understand that for my purposes it is preferable to use a design matrix
like that
> design
A B C
sample.1 1 0 0
sample.2 1 0 0
sample.3 0 1 0
sample.4 0 1 0
sample.5 0 0 1
and average for the contrast like this
> lrt <- glmLRT(fit, contrast=c(-0.5,-0.5,1))
But what would happen if there is a strong imbalance between samples A
and B, eg:
> design
A B C
sample.1 1 0 0
sample.2 1 0 0
sample.3 1 0 0
sample.4 1 0 0
sample.5 1 0 0
sample.6 1 0 0
sample.7 0 1 0
sample.8 0 1 0
sample.9 0 0 1
Should I still use the above approach or is it more advisable to put A
and B in one group and test AB vs C?
> design
A.B C
sample.1 1 0
sample.2 1 0
sample.3 1 0
sample.4 1 0
sample.5 1 0
sample.6 1 0
sample.7 1 0
sample.8 1 0
sample.9 0 1
> lrt <- glmLRT(fit, contrast=c(-1,1))
Thanks a lot and best wishes,
Georg
Georg Otto <georg.otto at imm.ox.ac.uk> writes:
> Dear Bioconductors,
>
> I am working on RNA-seq data with multiple experimental factors and I am
> trying to reproduce the edgeR manual, chapter 3.2.3, GLM approach.
>
>
>> design <- model.matrix(~0+group, data=y$samples)
>> colnames(design) <- levels(y$samples$group)
>> design
> A B C
> sample.1 1 0 0
> sample.2 1 0 0
> sample.3 0 1 0
> sample.4 0 1 0
> sample.5 0 0 1
>
>> fit <- glmFit(y, design)
>
>
> I want to know which genes are differentially expressed in C compared to
> the other groups, so I chose to compare C to the average of A and B
>
>> lrt <- glmLRT(fit, contrast=c(-0.5,-0.5,1))
>
>
> Alternatively I could put A and B in a single group
>
>> design
> A.B C
> sample.1 1 0
> sample.2 1 0
> sample.3 1 0
> sample.4 1 0
> sample.5 0 1
>
>> fit <- glmFit(y, design)
>
> an compare C to A.B
>
>> lrt <- glmLRT(fit, contrast=c(-1,1))
>
>
> When I try this with my own data, the first approach gives me many more
> differentially expressed genes than the second one, but the second gene
> set is a subset of the first one. I would be very grateful if somebody
> could explain to me what is the difference between the approaches, and
> which one is the more appropriate for my purpose (find genes specific
> for condition C)
>
> Best wishes,
>
> Georg
>
>> sessionInfo()
>
> R version 3.0.1 (2013-05-16)
> Platform: x86_64-unknown-linux-gnu (64-bit)
>
> locale:
> [1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C
> [3] LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8
> [5] LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_GB.UTF-8
> [7] LC_PAPER=C LC_NAME=C
> [9] LC_ADDRESS=C LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C
>
> attached base packages:
> [1] stats graphics grDevices utils datasets methods base
>
> other attached packages:
> [1] limma_3.18.13
>
> loaded via a namespace (and not attached):
> [1] compiler_3.0.1 tools_3.0.1
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
More information about the Bioconductor
mailing list