[BioC] vsn in BioConductor 1.2

w.huber at dkfz-heidelberg.de w.huber at dkfz-heidelberg.de
Mon Jul 21 22:22:03 MEST 2003


Hi Charles,

> 1) I would seriously consider reducing the amount of data you feed this
> program. It took 4.5 hours to process a 15,552 x 38 matrix on a 1.2 GHz
> Pentium III. There is a reason why the function vsnh exists. Unless you have
> some serious GHz, you probably want to run vsn on a random sample of genes
> or on one array at a time.

The program is indeed quite slow. The run time is about
    t = c * no.rows * no. columns
and according to your numbers c = about 3ms on your machine. There is a
lot of number crunching in vsn. With Dennis Kostka I have an experimental
version that is written in C, but even that is "only" faster by a factor
of 2-3.

A good strategy is indeed to run the program on a random sample of genes
(rows), and then use vsnh to apply the transformation to the whole data
matrix. See normalize.AffyBatch.vsn for an example. A random subset of a
few thousand spots should usually do. It will not be helpful to split up
the task by arrays (e.g. one array at a time) since the net run time will
be the same.

> 2) Assuming that you are using data from a two channel microarray, I
> strongly suspect that the red and green channels need to be side by side in
> your matrix. I think the point is to quantify measurement variation without
> contamination from any unnecessary source. I don't see any other way that
> pair information is being passed to vsn.

If you pass a 2*k data matrix from k red/green slides, with the colors
next to each other, vsn does not care about the ordering of the columns -
so it does not a make a difference whether the columns are ordered R1, G1,
R1, G2, ... Gk or R1,... Rk, G1, ... Gk. If someone is not confortable
with this, they can also call in vsn in turn for each array separately.
Empirically, I've found that this makes hardly a difference. (The
parameter estimation is not affected by the different correlations within
and between arrays.)

However, there should not be pronounced batch effects (e.g. arrays 1..50
looking technically very different from arrays 51...100).


> 3) I think that your problem and my old problem are likely to be quite
> different. I fed the program data in a format it didn't understand and you
> probably fed the program more data than it could process in a reasonable
> amount of time. (Since the program doesn't use "much" memory, you wouldn't
> have heard the hard drive running even if the program was still running.)

Yes. The error message about infinite likelihood has nothing to do with
the program's long, but finite, CPU time consumption.

> 4) I am pleased with the results I'm now getting from vsn. ...

That's always nice to hear :)

Best regards
  Wolfgang



More information about the Bioconductor mailing list