[BioC] limma
Gordon K Smyth
smyth at wehi.EDU.AU
Sat Apr 9 03:35:54 CEST 2011
Hi Wolfgang,
> Date: Thu, 07 Apr 2011 12:07:05 +0200
> From: Wolfgang Huber <whuber at embl.de>
> To: bioconductor at r-project.org
> Subject: Re: [BioC] limma
>
> Hi Gordon
>
>> .... "limma ensures that all probes are
>> assigned at least a minimum non-zero expression level on all arrays, in
>> order to minimize the variability of log-intensities for lowly expressed
>> probes. Probes that are expressed in one condition but not other will be
>> assigned a large fold change for which the denominator is the minimum
>> expression level. This approach has the advantage that genes can be
>> ranked by fold change in a meaningful way, because genes with larger
>> expression expression changes will always be assigned a larger fold
>> change."
This comment was in the context of genes expressed in one condition and
not the other (and was part of a longer post). In this context the
estimated fold change is essentially monotonic in the higher expression
level, provided the zero value is offset away from zero, so larger
expression changes do translate into larger fold changes. In other
contexts, it is a question of importance ranking, which I guess is the
issue that you're raising below.
> I am not sure I follow:
>
> (i) (20 + 16) / (10 + 16) < (15000 + 16) / (10000 + 16)
>
> but
>
> (ii) 20 / 10 > 15000 / 10000
>
> You assume that measurements of 20 and 10 are less reliable (or perhaps
> biologically less important?) than measurements of 20000 and 10000, thus
> that ranking (i) should be used
Generally I rank probes by a combination of statistical significance and
fold change, not by fold change alone. However, the discussion is in the
context of Illumina expression data, and Illumina intensities of 10 and 20
are almost certain to be from non-expressed probes, hence contain no
biological signal. So, yes, I would generally view measurements of 20000
and 10000 as both statistically more precise and biologically more
important than 20 and 10, and I would therefore want to rank as (i) rather
than (ii). I'm pretty sure that you would too.
> - but that depends on an error model (which you encode in the
> pseudocount parameter '16')
I put more faith in experimental evidence than I do in statistical error
models. The fact that offsetting the intensities away from zero reduces
the FDR is an observation from considerable testing with calibration data
sets. The evidence doesn't rely on an error model. Much of the evidence
is laid out in the paper that I cited in my earlier email:
Shi, W, Oshlack, A, and Smyth, GK (2010). Optimizing the noise versus bias
trade-off for Illumina Whole Genome Expression BeadChips. Nucleic Acids
Research 38, e204.
> and a subjective trade-off between precision and effect size.
The fact that the value is chosen from experience with data, rather than
as as a parameter estimated from a mathematical model, doesn't make it
subjective. As I've said, I take mathematical models with a grain of
salt.
It's easy to verify experimentally that well known preprocessing
algorithms, like RMA for Affy data or vst for Illumina data (you're an
author!), also have the effect of offsetting intensities away from zero
before logging them. I think it is a useful insight to observe that this
offsetting is a good part of why those algorithms have good statistical
properties. vst has an effective offset of around 200 (Wei et al, Tables
2 and 3). As far as I know, the offset was not designed into either of
the above algorithms. I suspect it was rather a fortuitious but
unexpected outcome. The offset that vst seems to have isn't a natural
outcome of the variance stabilization model, because it generally turns
out to be much larger than the offset that would best stabilize the
variance. Anyway, we find that by using more modest offsets in the range
16-50 for Illumina data, we can achieve FDR as good as vst but with less
bias, much less contraction of the fold changes. Again, this is a
conclusion from testing rather than from modelling.
I prefer to make the offset explicit, clearly visible to users, rather
than leaving it implicit or unexpected. This approach (neqc etc) isn't
the only good way to address noise, bias and variance stabilization
issues, but it's the one that seems to work best for me at the moment.
Cheers
Gordon
> I agree with you that the approach is useful, and also that it is good
> to provide a very simple recipe for people that either cannot deal with
> or do not care about the quantitative details. Still, this post is for
> the people that do :)
>
> Cheers
> Wolfgang
______________________________________________________________________
The information in this email is confidential and intend...{{dropped:4}}
More information about the Bioconductor
mailing list