[BioC] Limma: Normalization with large numbers of differentially expressed genes

Wed Oct 10 11:58:26 CEST 2007

Quoting Serge Eifes <serge.eifes at lbmcc.lu>:

>
> Dear all,
>
> We have performed a time-series experiment (2h, 6h, 10h, 48h, 72h) on
> dual-channel arrays where we want to compare gene expression between treated
> and time-matched untreated cells.
>
> This experiment was done using  Agilent 4112F human whole genome microarrays
> (with 45k features). Statistical analysis is performed using LIMMA 2.10.7 on
> R 2.5.1.
> Background correction was performed using normexp with an offset of 100.
> Loess normalization was done using a span of 0.4 and 12 iterations.
>
> Now I have encountered the following problems during data analysis:
>
> 1) The microarrays for the whole experiment were scanned at quite low
> intensities. This means that about 22k features on average per array have an
> A-value located between 7 and 8.
>
> 2) It seems as there are also quite large numbers of differentially
> expressed probes when considering the raw per-probe p-values from the
> moderated t-test for the different time-points and the p-values for the
> moderated F-statistic after MHC (FDR, BH).
>
> Numbers of significant probes with raw per-probe p-value < 0.05 from
> moderated t-test as retrieved from the "MArrayLM" object are shown here:
> * t=0h: 1419
> * t=2h: 9428
> * t=6h: 15013
> * t=10h: 13641
> * t=48h: 21713
> * t=72h: 18027
>
> Here are shown the number of significant probes I get by using moderated
> F-statistic (nestedF) with p<0.05 after MHC:
> * t=0: 515
> * t=2h: 6278
> * t=6h: 11460
> * t=10h: 10560
> * t=48h: 17250
> * t=72h: 14311
>
> Now I've got the following questions:
>
> * Is the accumulation of signals at such low average intensities problematic
> for the normalization process (beside that it may introduce a higher
> variability into the measurements)?
>
> * I already read in a reply by G.K. Smyth ([BioC] limma Normalization
> question) that loess normalization might get problematic when having around
> 20% of differentially expressed genes.  So in this case, does Loess
> normalization still work correctly, considering such large numbers of
> differentially expressed genes? If not, what kind of normalization may be
> more appropriate for this kind of data.
>
> Thanks in advance!
>
> Best Regards,
> Serge Eifes
Hi Serge,

having a lot of spots with low intensity would only add noise but not  
create much problem
for normalisation. You used the normexp method for background  
correction, which can be
very good, when used with an appropriate offset, to make the M values  
of low intensity
spots converge nicely towards zero, so i wouldn't worry excessively  
about that.

regarding having a large % of differentially expressed genes... that's  
more of a problem.
The quote of 20% sounds like a conservative estimate, but it does  
really depend on how
those 20% of spots are distributed... and you may get away with  
more... Loess is simply
used to fit a curve to teh population, and teh assumption is made that  
this represents
the non-changing baseline... where spots with no differential  
expressions should align.
This of course assumes that most of teh data are evenly distributed on  
both sides of the
curve, more or less... and these assumptions are generally okay, and  
even some deviations
are tolerated. But you have to look at each experiment and decide.

What do teh MA plots look like? Looking at MA plots you can see the  
distribution of M
values (before normalisation, so make an MA object using normalisation  
between arrays,
method="none"). You can compare those plots with MA plots after  
normalisation, to see teh
efect the normalisation procedure has on the whole distribution.
You might find that loess will distort the distribution in ways that  
do not seem
reasonable, when there are too many differentially expressed genes.  
How many is too many?
It depends. It depends on the number, but also on their distribution across
intensities... MA plots are the best to check this sort of thing.

I had an experiment that resulted in a large number of genes being  
activated (going from
low or no expression to a decent level). The MA plot looked something  
like this
(combining several slides, after lmfit):

http://mcnach.com/MISC/MAplots2.png

When using loess normalisation, my activated spots contributed  
excessively to the total
population, especially between the ranges A=11 to A=12.5 or so... the  
resulting loess
curve was clearly pushed up in that area, and the resulting normalised  
data was
distorted, being pushed down.
For this sort of cases the best is to have a set of known invariant  
spots, or control
spots whose behaviour is expected, and use those to normalise the  
whole thing. But often
we don't have those.
In the case above, I was able to identify reasonably easily a large  
number of those genes
that were being activated, and I could flag them so that they would  
not be included in
the normalisation. By removing a reasonable proportion of them I was  
able to eliminate
the distortion and the final plots look reasonable to me. I took a lot  
of time to verify
genes and make sure that everything was behaving alright, so I was  
happy with this
method. However, it requires that you are familiar with the biology of  
teh experiment,
and that you check and recheck that what you're doing doesn't cause harm.
On the positive side... when I compared the results I got when using  
loess directly on
all spots (despite distortion) and with my more carefully chosen  
ones... I found that
whilst the latter was better in general, I could still pick out pretty  
much the same
genes either way. Perhaps I was looking for a population that was  
already distinct
enough...

I'm not sure this is of any help to you right now... I guess the  
bottom line is: make
plots, before and after normalisation, have a good idea of what you  
are expecting and see
how far it is from what you get. Loess is just fitting a curve to the  
distribution,
according to certain parameters... if you think you know what the  
curve should look like
(representing the non-changing bulk of teh data), you can often find a  
work-around... as
long as you know what is expected i your experiment, to some degree.  
Without proper
control spots, one has to be careful, and understand the experiment.

Jose

-- 
Dr. Jose I. de las Heras                      Email: J.delasHeras at ed.ac.uk
The Wellcome Trust Centre for Cell Biology    Phone: +44 (0)131 6513374
Institute for Cell & Molecular Biology        Fax:   +44 (0)131 6507360
Swann Building, Mayfield Road
University of Edinburgh
Edinburgh EH9 3JR
UK

-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.