[BioC] edgeR design matrix, one group vs average of other groups
Gordon K Smyth
smyth at wehi.EDU.AU
Sat Mar 15 23:20:58 CET 2014
Dear Georg,
It makes no difference how many samples you have in each group (and why
should it?).
The choice of test depends on the scientific questions you wish to answer,
not on technical aspects of your dataset. The only reason that you might
combine A and B would be if you specifically wanted to find genes that are
*same* in A and B but different in C. From what you have said, that is
not want you want.
Best wishes
Gordon
> Date: Fri, 14 Mar 2014 19:26:08 +0000
> From: Georg Otto <georg.otto at imm.ox.ac.uk>
> To: <bioconductor at stat.math.ethz.ch>
> Subject: Re: [BioC] edgeR design matrix, one group vs average of other
> groups
>
>
> Thanks a lot, Yunshun and Ryan for your informative answers. I
> understand that for my purposes it is preferable to use a design matrix
> like that
>
>
>> design
> A B C
> sample.1 1 0 0
> sample.2 1 0 0
> sample.3 0 1 0
> sample.4 0 1 0
> sample.5 0 0 1
>
> and average for the contrast like this
>
>> lrt <- glmLRT(fit, contrast=c(-0.5,-0.5,1))
>
>
> But what would happen if there is a strong imbalance between samples A
> and B, eg:
>
>> design
> A B C
> sample.1 1 0 0
> sample.2 1 0 0
> sample.3 1 0 0
> sample.4 1 0 0
> sample.5 1 0 0
> sample.6 1 0 0
> sample.7 0 1 0
> sample.8 0 1 0
> sample.9 0 0 1
>
>
> Should I still use the above approach or is it more advisable to put A
> and B in one group and test AB vs C?
>
>> design
> A.B C
> sample.1 1 0
> sample.2 1 0
> sample.3 1 0
> sample.4 1 0
> sample.5 1 0
> sample.6 1 0
> sample.7 1 0
> sample.8 1 0
> sample.9 0 1
>
>> lrt <- glmLRT(fit, contrast=c(-1,1))
>
>
> Thanks a lot and best wishes,
>
> Georg
>
>
> Georg Otto <georg.otto at imm.ox.ac.uk> writes:
>
>
>
>> Dear Bioconductors,
>>
>> I am working on RNA-seq data with multiple experimental factors and I am
>> trying to reproduce the edgeR manual, chapter 3.2.3, GLM approach.
>>
>>
>>> design <- model.matrix(~0+group, data=y$samples)
>>> colnames(design) <- levels(y$samples$group)
>>> design
>> A B C
>> sample.1 1 0 0
>> sample.2 1 0 0
>> sample.3 0 1 0
>> sample.4 0 1 0
>> sample.5 0 0 1
>>
>>> fit <- glmFit(y, design)
>>
>>
>> I want to know which genes are differentially expressed in C compared to
>> the other groups, so I chose to compare C to the average of A and B
>>
>>> lrt <- glmLRT(fit, contrast=c(-0.5,-0.5,1))
>>
>>
>> Alternatively I could put A and B in a single group
>>
>>> design
>> A.B C
>> sample.1 1 0
>> sample.2 1 0
>> sample.3 1 0
>> sample.4 1 0
>> sample.5 0 1
>>
>>> fit <- glmFit(y, design)
>>
>> an compare C to A.B
>>
>>> lrt <- glmLRT(fit, contrast=c(-1,1))
>>
>>
>> When I try this with my own data, the first approach gives me many more
>> differentially expressed genes than the second one, but the second gene
>> set is a subset of the first one. I would be very grateful if somebody
>> could explain to me what is the difference between the approaches, and
>> which one is the more appropriate for my purpose (find genes specific
>> for condition C)
>>
>> Best wishes,
>>
>> Georg
>>
>>> sessionInfo()
>>
>> R version 3.0.1 (2013-05-16)
>> Platform: x86_64-unknown-linux-gnu (64-bit)
>>
>> locale:
>> [1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C
>> [3] LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8
>> [5] LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_GB.UTF-8
>> [7] LC_PAPER=C LC_NAME=C
>> [9] LC_ADDRESS=C LC_TELEPHONE=C
>> [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C
>>
>> attached base packages:
>> [1] stats graphics grDevices utils datasets methods base
>>
>> other attached packages:
>> [1] limma_3.18.13
>>
>> loaded via a namespace (and not attached):
>> [1] compiler_3.0.1 tools_3.0.1
______________________________________________________________________
The information in this email is confidential and intend...{{dropped:4}}
More information about the Bioconductor
mailing list