[BioC] limma: is it best to always include paired structure (sibship) in design?

Tue Sep 18 17:47:13 CEST 2012

Hi Guido,

On 9/17/2012 5:50 PM, Hooiveld, Guido wrote:
> Hi,
> I am doing an analysis on a dataset, and have a question on whether or not to include the paired structure (sibship) in the model fitted by limma if this is not explicitly needed. We discussed this internally, but didn't get consensus. Hence, I would like to ask for opinions on this list... :)

Personally I would keep the pairing in the model regardless. The goal 
IMO is to capture as much of the observed variability with biologically 
meaningful/plausible coefficients. Accounting for intra-subject 
variability is fair game in this context, and as you see, it increases 
your power to detect differences.

Best,

Jim

>
> Let me explain:
> - Affymetrix data, RMA normalized.
> - samples from 20 subjects were analyzed on arrays, obtained from 10 'young' and 10 'old' individuals.
> - samples were taken at baseline, and after treatment with a drug (WY).
>
> Part of target file:
>> targets
>            Filename       Treatment     Sibship Category
> 1        G068_B07_05_05_CTRL.CEL Ctrl     5        old
> 2        G068_B09_06_06_CTRL.CEL Ctrl     6        old
> 3        G068_C05_07_07_CTRL.CEL Ctrl     7        old
> 4        G068_C09_09_09_CTRL_2.CEL        Ctrl     9        young
> 5        G068_D05_10_10_CTRL.CEL Ctrl     10      young
> 6        G068_D07_11_11_CTRL.CEL Ctrl     11      young
> 7        G068_F07_17_05_WY.CEL    WY     5        old
> 8        G068_F09_18_06_WY.CEL    WY     6        old
> 9        G068_G05_19_07_WY.CEL    WY     7        old
> 10      G068_G09_21_09_WY.CEL    WY     9        young
> 11      G068_H05_22_10_WY.CEL    WY     10      young
> 12      G068_H07_23_11_WY.CEL    WY     11      young
>
> It is obvious that for the treatment effect a paired analysis should be performed (using sibship info). This could be done for the whole group, or for young and old separately.
>
>> TC<- as.factor(paste(targets$Treatment, targets$Category, sep="."))
>> design<- model.matrix(~0+TC+Sibship)
>>
>> fit<- lmFit(x.norm, design)
> Coefficients not estimable: Sibship8 Sibship12
> Warning message:
> Partial NA coefficients for 19682 probe(s)
>> #test for effect of WY in old and young
>> cont.matrix<- makeContrasts(
> + WYold=(TCWY.old-TCCtrl.old),
> + WYyoung=(TCWY.young-TCCtrl.young),
> + levels=design)
>> fit2<- contrasts.fit(fit, cont.matrix)
>> fit2<- eBayes(fit2)
>
> If I would like to identify the probesets that are differentially expressed between old and young under either control or treatment conditions, I am essentially performing an unpaired t-test. Hence, info on sibship is thus not required.
>
>> cont.matrix<- makeContrasts(
> + ctrlold_ctrlyoung=(TCCtrl.old-TCCtrl.young),
> + WYold_WYyoung=(TCWY.old-TCWY.young),
> + levels=design)
> However, I noticed that the results of these 2nd set of comparisons (the old vs young) are strongly affected by including or not the sibship in the design. In other words, if I define this design:
>> design<- model.matrix(~0+TC+Sibship)
> I get a completely different set of top regulated probesets for the before mentioned contrasts when compared to this design (without sibship):
>> design<- model.matrix(~0+TC)
> I noticed that also the p-values are much smaller when including the sibship.
>
> As an example,
> WITH sibship:
>> topTable(fit2,coef=1)
>               ID     logFC  AveExpr         t      P.Value    adj.P.Val        B
> 15031    671_at -4.763308 5.591752 -51.58016 4.457435e-18 6.583366e-14 29.06475
> 9938    4317_at -5.545104 4.893897 -50.17288 6.689732e-18 6.583366e-14 28.82092
> 7454    2944_at -7.992487 5.861613 -43.89646 4.747228e-17 3.114498e-13 27.56297
> 14844 654433_at  5.136677 5.807219  40.89921 1.337134e-16 6.579367e-13 26.84567
> 1115    1088_at -4.762000 5.650997 -39.67501 2.085766e-16 8.210408e-13 26.52704
> 656    10321_at -6.458380 5.445312 -37.74764 4.319761e-16 1.417026e-12 25.99187
> 4162    1991_at -4.080680 6.171290 -36.76955 6.339135e-16 1.627014e-12 25.70344
> 3784    1669_at -6.130596 5.171480 -36.66315 6.613207e-16 1.627014e-12 25.67134
> 9557    4057_at -6.079963 5.789428 -32.16516 4.460724e-15 9.755107e-12 24.17063
> 2584    1359_at  3.730962 6.155636  30.49084 9.709508e-15 1.911025e-11 23.53108
> WITHOUT sibship:
>> topTable(fit2,coef=1)
>               ID     logFC  AveExpr         t      P.Value    adj.P.Val         B
> 14844 654433_at  5.320186 5.807219  9.802842 2.297163e-09 2.577632e-05 11.534050
> 18751   9173_at  3.052422 4.715249  9.439677 4.453505e-09 2.577632e-05 10.917557
> 6220    2624_at  2.443030 5.771388  9.281647 5.970493e-09 2.577632e-05 10.643512
> 13773  59340_at  4.154059 4.967316  9.221019 6.686698e-09 2.577632e-05 10.537431
> 2584    1359_at  3.572039 6.155636  9.095515 8.466416e-09 2.577632e-05 10.316169
> 16233  79608_at  1.774024 5.678042  9.073712 8.822506e-09 2.577632e-05 10.277501
> 2040    1232_at  2.911866 4.211083  9.053443 9.167473e-09 2.577632e-05 10.241491
> 17108  83478_at  1.522142 6.045796  8.750724 1.635842e-08 4.024580e-05  9.696605
> 1855    1178_at  3.134552 8.481612  8.619180 2.111666e-08 4.304048e-05  9.455663
> 6733    2766_at -2.157836 5.223548 -8.568223 2.332607e-08 4.304048e-05  9.361647
> Thus, although it is not required, would it be recommended to include for the 2nd set of contrasts the paired structure of the data in the design?
> I would argue to do so (since intuitively I feel it would be good to always include as much info as possible on the correlation structure of the data), but as said not everyone in the project team agrees with me.
>
> So any opinions/comments are much appreciated.
>
> Regards,
> Guido
>
>
>> sessionInfo()
> R version 2.15.1 (2012-06-22)
> Platform: i386-pc-mingw32/i386 (32-bit)
>
> locale:
> [1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252    LC_MONETARY=English_United States.1252 LC_NUMERIC=C                           LC_TIME=English_United States.1252
>
> attached base packages:
> [1] splines   stats     graphics  grDevices utils     datasets  methods   base
>
> other attached packages:
> [1] hugene11stv1hsentrezg.db_15.1.0 org.Hs.eg.db_2.7.1              RSQLite_0.11.1                  DBI_0.2-5                       hugene11stv1hsentrezgcdf_15.1.0 AnnotationDbi_1.18.3
>   [7] qvalue_1.30.0                   multtest_2.12.0                 affy_1.34.0                     limma_3.12.3                    pamr_1.54                       survival_2.36-14
> [13] cluster_1.14.2                  bladderbatch_1.0.3              Biobase_2.16.0                  BiocGenerics_0.2.0              sva_3.2.1                       mgcv_1.7-20
> [19] corpcor_1.6.4                   BiocInstaller_1.4.7
>
> loaded via a namespace (and not attached):
> [1] affyio_1.24.0         grid_2.15.1           IRanges_1.14.4        lattice_0.20-10       MASS_7.3-21           Matrix_1.0-9          nlme_3.1-104          preprocessCore_1.18.0
> [9] stats4_2.15.1         tcltk_2.15.1          tools_2.15.1          zlibbioc_1.2.0
> ---------------------------------------------------------
> Guido Hooiveld, PhD
> Nutrition, Metabolism&  Genomics Group
> Division of Human Nutrition
> Wageningen University
> Biotechnion, Bomenweg 2
> NL-6703 HD Wageningen
> the Netherlands
> tel: (+)31 317 485788
> fax: (+)31 317 483342
> email:      guido.hooiveld at wur.nl
> internet:   http://nutrigene.4t.com
> http://scholar.google.com/citations?user=qFHaMnoAAAAJ
> http://www.researcherid.com/rid/F-4912-2010
>
>
> 	[[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

-- 
James W. MacDonald, M.S.
Biostatistician
University of Washington
Environmental and Occupational Health Sciences
4225 Roosevelt Way NE, # 100
Seattle WA 98105-6099