[BioC] limma

Wolfgang Huber whuber at embl.de
Sun Apr 10 16:33:28 CEST 2011


Hi Gordon

I think we agree, my only reason for making this post is to point out 
that while the choice of '16' might be a good comprise overall, better 
choices can exist depending on the dataset. E.g. if there are many 
replicates, then these can compensate the imprecision from using a 
smaller offset and improved biases may be had.

Variance stabilising transformations are of course very related to the 
offseting that you propose. The "glog" in vsn is

   glog(x) = log( x+ sqrt(x^2 + c) )

which is in practice similar to the shifted log, i.e. log (x+a). (It 
even avoids problems that the shifted log has when x approaches -a. See 
also e.g. PMID 12761059.)

Such transformations provide a good way of looking at microarray data, 
and indeed at high-thoughput sequencing data, for many though not all uses.

As you say in your NAR article, there is a continuum of choices along 
the precision - bias tradeoff axis, which can be parameterized by 'a' or 
'c'.

Btw, a subtlety of vst that many users seem not to be aware of is that 
it uses the between beads variance (for the same gene, in the same 
sample) rather than the more relevant (and larger) between replicates 
variance for finding its 'c'.

	Best wishes
	Wolfgang


PS A pet peeve of mine, how come that people are OK to state numbers 
such as '16' without reference to units: ideally, 'official' SI units, 
but even an ad hoc definition would do, based on a common agreement on 
reagent quantities, scanner settings, software settings, to make the 
oxymoron 'arbitrary fluorescence units' just a little bit less so.



Il Apr/9/11 3:35 AM, Gordon K Smyth ha scritto:
> Hi Wolfgang,
>
>> Date: Thu, 07 Apr 2011 12:07:05 +0200
>> From: Wolfgang Huber <whuber at embl.de>
>> To: bioconductor at r-project.org
>> Subject: Re: [BioC] limma
>>
>> Hi Gordon
>>
>>> .... "limma ensures that all probes are
>>> assigned at least a minimum non-zero expression level on all arrays, in
>>> order to minimize the variability of log-intensities for lowly expressed
>>> probes. Probes that are expressed in one condition but not other will be
>>> assigned a large fold change for which the denominator is the minimum
>>> expression level. This approach has the advantage that genes can be
>>> ranked by fold change in a meaningful way, because genes with larger
>>> expression expression changes will always be assigned a larger fold
>>> change."
>
> This comment was in the context of genes expressed in one condition and
> not the other (and was part of a longer post). In this context the
> estimated fold change is essentially monotonic in the higher expression
> level, provided the zero value is offset away from zero, so larger
> expression changes do translate into larger fold changes. In other
> contexts, it is a question of importance ranking, which I guess is the
> issue that you're raising below.
>
>> I am not sure I follow:
>>
>> (i) (20 + 16) / (10 + 16) < (15000 + 16) / (10000 + 16)
>>
>> but
>>
>> (ii) 20 / 10 > 15000 / 10000
>>
>> You assume that measurements of 20 and 10 are less reliable (or perhaps
>> biologically less important?) than measurements of 20000 and 10000, thus
>> that ranking (i) should be used
>
> Generally I rank probes by a combination of statistical significance and
> fold change, not by fold change alone. However, the discussion is in the
> context of Illumina expression data, and Illumina intensities of 10 and
> 20 are almost certain to be from non-expressed probes, hence contain no
> biological signal. So, yes, I would generally view measurements of 20000
> and 10000 as both statistically more precise and biologically more
> important than 20 and 10, and I would therefore want to rank as (i)
> rather than (ii). I'm pretty sure that you would too.
>
>> - but that depends on an error model (which you encode in the
>> pseudocount parameter '16')
>
> I put more faith in experimental evidence than I do in statistical error
> models. The fact that offsetting the intensities away from zero reduces
> the FDR is an observation from considerable testing with calibration
> data sets. The evidence doesn't rely on an error model. Much of the
> evidence is laid out in the paper that I cited in my earlier email:
>
> Shi, W, Oshlack, A, and Smyth, GK (2010). Optimizing the noise versus
> bias trade-off for Illumina Whole Genome Expression BeadChips. Nucleic
> Acids Research 38, e204.
>
>> and a subjective trade-off between precision and effect size.
>
> The fact that the value is chosen from experience with data, rather than
> as as a parameter estimated from a mathematical model, doesn't make it
> subjective. As I've said, I take mathematical models with a grain of salt.
>
> It's easy to verify experimentally that well known preprocessing
> algorithms, like RMA for Affy data or vst for Illumina data (you're an
> author!), also have the effect of offsetting intensities away from zero
> before logging them. I think it is a useful insight to observe that this
> offsetting is a good part of why those algorithms have good statistical
> properties. vst has an effective offset of around 200 (Wei et al, Tables
> 2 and 3). As far as I know, the offset was not designed into either of
> the above algorithms. I suspect it was rather a fortuitious but
> unexpected outcome. The offset that vst seems to have isn't a natural
> outcome of the variance stabilization model, because it generally turns
> out to be much larger than the offset that would best stabilize the
> variance. Anyway, we find that by using more modest offsets in the range
> 16-50 for Illumina data, we can achieve FDR as good as vst but with less
> bias, much less contraction of the fold changes. Again, this is a
> conclusion from testing rather than from modelling.
>
> I prefer to make the offset explicit, clearly visible to users, rather
> than leaving it implicit or unexpected. This approach (neqc etc) isn't
> the only good way to address noise, bias and variance stabilization
> issues, but it's the one that seems to work best for me at the moment.
>
> Cheers
> Gordon
>
>> I agree with you that the approach is useful, and also that it is good
>> to provide a very simple recipe for people that either cannot deal with
>> or do not care about the quantitative details. Still, this post is for
>> the people that do :)
>>
>> Cheers
>> Wolfgang
>
> ______________________________________________________________________
> The information in this email is confidential and inte...{{dropped:14}}



More information about the Bioconductor mailing list