[BioC] Is a subset of my arrays from degraded RNA?

Thu Jul 24 17:01:20 CEST 2014

Dear List,

Although I do realise that my question has more to do with actual data
interpretation that coding using BioC packages, I'm hoping for some
input from other users with experience in microarray data analysis.

I order to support my explanation below, I have made a pdf with
diagnostic plots. I will refer to specific slides as I go along. The
presentation can be downloaded here: https://db.tt/jBqPNxIN

At the moment I'm analysing some microarray data as part of a
collaboration. Unfortunately, I have very little knowledge about the
actual generation/processing of these samples which could help address
my question.

By doing a boxplot on the raw Affymetrix chip data (from the U133plus2
platform), I noticed 2 'batches' based on differences in signal
intensities. Hierarchical clustering using all probesets on the array
supports this devision (Page 1 and 2). Noteworthy, this separation
into batches (i.e. a high and a low intensity batch) can partially be
traced back to the ScanDate of the arrays. That is, the ~100 samples
were scanned over three consecutive days; all samples scanned on the
first day belong to the high intensity batch whereas all samples
scanned on day 3 belong to the low intensity batch. Noteworthy, around
half of the samples scanned on day 2 fall into the high and low
intensity batch, respectively.

When I do a RLE plot (Page 3 - top), the median value for most of the
samples from the low intensity batch is between 0.1 and 0.2 (and not
zero as expected). Further, whereas ~40% of the probesets are called
"present" in the high intensity batch using the simpleaffy package,
only around ~30-35% are called present in the low intensity batch
(Page 3 - bottom).

Now, when I do boxplots specifically for the AFFX control probesets, I
discovered that the intensity is in fact higher in all low intensity
samples (Page 4).
Furthermore, when I focus on the Affy hybridization controls (i.e.
bioB, bioC, BioD and creX) the line plot looks good and the signal
intensity is comparable between samples in the two batches (Page 5,
left side). If I instead plot the poly-A controls I again see a
significant difference in intensity between batches (with
low-intensity samples having a higher signal). In addition, the signal
values consistently follow the order Phe<Lys<Thr<Dap (Page 5, right
side). NB: I'm a bit unsure as to the importance of the latter
observation.

The QC plots presented above suggest to me that the RNA from the low
intensity samples could potentially suffer from a RNA degradation
issue. However, both the 3'/5' ratios for beta-actin and GAPDH as well
as RNA degradation plots using affyPLM do not support my assumption
regarding degraded RNA (Pages 6 and 7). In fact, the ratios for GAPDH
indicates a higher signal intensity in the 5-prime end, which I find a
bit odd.
However, when I instead take advantage of the recent
AffyRNADegradation package I do get a small yet significant difference
between batches in terms of the computed decay value (aka parameter d)
(Page 8).

I then tried to normalize the two batches of samples independently
(using RMA). This allowed me to compare the mean signal intensity for
each probeset across the chip as the biological samples are indeed
comparable between batches. A scatterplot (Page 9) clearly
demonstrates that many probesets lie close to the diagonal line
despite the overall difference in intensity described on the first
page.
Further, by correlating the expression of specific probesets to an
established physiological variable it is apparent that the slight drop
in signal intensity do not affect the strong association to the
physiological variable (Page 10). If I instead focus on another
representative probeset--which is farther from the diagonal line--the
correlation to the same physiological readout is clearly weaker in the
low intensity batch (Page 11).

Could it be that only a smaller subset of the transcripts are
significantly affected by RNA degradation in the low intensity
samples? and how could I potentially demonstrate this?

In relation to the question: When I do MA plots against a "pseudo
reference chip" representing the probeset-wise medians across all ~100
RMA normalized samples, it also becomes apparent that a fraction of
the probesets for most of the low intensity samples lie far below the
M=0 line (see Page 12 for a representative example). However, to my
surprise only a very small fraction of probesets are consistently
below M=-1.5. In other words, different low-intensity samples have
different "outlying" probesets compared to the overall median.

To summarize, I have now put forward various QC plots that show that
the low intensity samples are overall are different. As I'm unsure
which way forward is the best (NB: my aim to the do a standard DEG
analysis), I would appreciate any thoughts or comments from members of
this list.

Kind regards,
Peter