[R] Snow and multi-processing

Martin Morgan mtmorgan at fhcrc.org
Sun Nov 30 03:28:25 CET 2008

Hi Marco --

Do you know about Bioconductor, http://bioconductor.org ? The
rowttests function in the genefilter package will do what you want
efficiently and on a single node.

> # install the package
> source('http://bioconductor.org/biocLite.R')
> biocLite('genefilter')
> # do 500k t-tests
> library(genefilter)
> m <- matrix(runif(500000*20), ncol=20)
> f <- factor(rep(c("A", "B"), each=10))
> system.time(rowttests(m, f))
   user  system elapsed 
  0.964   0.128   1.095 

A package like limma, with it's great vignette, is an excellent
introduction to statistical analyses that make better use of this type
of data. See the links to Bioconductor packages at


A little more below...


"Blanchette, Marco" <MAB at stowers-institute.org> writes:

> Dear R gurus,
> I have a very embarrassingly parallelizable job that I am trying to
> speed up with snow on our local cluster. Basically, I am doing
> ~50,000 t.test for a series of micro-array experiments, one gene at
> a time. Thus, I can easily spread the load across multiple
> processors and nodes.
> So, I have a master list object that tells me what rows to pick up
> for each genes to do the t.test from series of microarray
> experiments containing ~500,000 rows and x columns per experiments.
> While trying to optimize my function using parLapply(), I quickly
> realized that I was not gaining any speed because every time a test
> was done on one of the item in the list, the 500,000 line by x
> column matrix had to be shipped along with the item in the list and
> the traffic time was actually longer than the computing time.
> However, if I export the 500,000 object first across the spawned
> processes as in this mock script
> cl <- makeCluster(nnodes,method)
> mArrayData <- getData(experiments)
> clusterExport(cl, 'mArrayData')
> Results <- parLapply(cl, theMapList, function(x) t.testFnc(x))

try writing this in a more 'functional' style, so all variables used
by the function in parLapply are passed to the function,

parLapply(cl, theMapList, function(probeList, bigArray) {
    x <- bigArray[probeList$A,]
    y <- bigArray[probeList$B,]
    doSomeTest(x, y)
}, bigArray=mArrayData)

snow will see to distributing bigArray in an appropriate way.

> With a function that define the mArrayData argument as a default parameter as in
> t.testFnc <- function(probeList, array=mArrayData){
>     x <- array[probeList$A,]
>     y <- array[probeList$B,]
>      res <- doSomeTest(x,y)
>     return(res)
> }
> Using this strategy, I was able to gain full advantage of my cluster
> and reduce the analysis time by the number of nodes I have in our
> cluster. The large data matrix was resident in each processes and
> didn't have to travel on the network every time a item from the list
> was pass to the function t.testFnc()
> However, I quickly realized that this works (the call to
> clusterExport() ) only when I run the script one line at a
> time. When the process is enclosed in a function, the object
> mArrayData is not exported, presumably because it's not a global
> object from the Master process.
> So, what is the alternative to push the content of an object to the
> slaves? The documentation in the snow package is a bit light and I
> couldn't find good example out there. I don't want to have the
> function getData() evaluated on each nodes because the argument to
> that functions are humongous and that would cause way too much
> traffic on the network. I want the result of the function getData(),
> the object mArrayData, propagated to the cluster only once and be
> available to downstream functions.

> Hope this is clear and that a solution will be possible.
> Many thanks
> Marco
> --
> Marco Blanchette, Ph.D.
> Assistant Investigator
> Stowers Institute for Medical Research
> 1000 East 50th St.
> Kansas City, MO 64110
> Tel: 816-926-4071
> Cell: 816-726-8419
> Fax: 816-926-2018
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

Martin Morgan
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M2 B169
Phone: (206) 667-2793

More information about the R-help mailing list