[R] error serialize (foreach)
Jon Skoien
jon.skoien at jrc.ec.europa.eu
Mon Dec 5 15:29:23 CET 2016
Parallel processing usually includes quite a lot of overhead, which is
expensive if the computation itself is quick. This is definitely an
example where the function is too simple to take advantage of
parallelization. Another thing is that your example has some errors,
which makes the effect even stronger, as you are only averaging over the
first three elements of the list.
I have modified the example below to call a more complicated function
than the mean function. Then the parallelized example is faster
(although not by much). To see the difference, replace the lapply lines
with "lapply(dat, mean)". Under the foreach example, you can also see
the same computation with clusterApply, which seems to be much more
efficient for this problem.
N <- 200000
myList <- vector('list', N)
for(i in 1:N){
myList[[i]] <- rnorm(100)
}
library(foreach)
library(doParallel)
ncores = 7
registerDoParallel(cores=ncores)
names(myList) = make.names(rep(1:ncores, length.out = N))
nms = 1:ncores
system.time(result <- foreach(i = 1:ncores) %do% {
dat <- myList[which(names(myList) == make.names(nms[i]))]
lapply(dat, FUN = function(x) log(sd(x)) + sd(x) + var(x))
} )
system.time(
result2 <- foreach(i = 1:ncores) %dopar% {
dat <- myList[which(names(myList) == make.names(nms[i]))]
lapply(dat, FUN = function(x) log(sd(x)) + sd(x) + var(x))
} )
foreach is not always the best choice for parallel processing. You could
also have a look at clusterApply:
f1 = function(x) mean(x)
f2 = function(x) log(sd(x)) + sd(x) + var(x)
cl = makeCluster(ncores)
clusterExport(cl, list("f1", "f2"))
dats = split(myList, names(myList))
system.time(res <- clusterApply(cl, dats, fun = function(x) lapply(x, f1)))
system.time(res <- lapply(dats, FUN = function(x) lapply(x, f1)))
system.time(res <- clusterApply(cl, dats, fun = function(x) lapply(x, f2)))
system.time(res <- lapply(dats, FUN = function(x) lapply(x, f2)))
lapply is still faster for the example with mean, but much slower for
the more complicated function.
Best,
Jon
On 12/4/2016 3:11 AM, Doran, Harold wrote:
> As a follow up to this, I have been able to generate a toy example of reproducible code that generates the same problem. Below is just a sample to represent the issue, but my data and subsequent functions acting on the data are much more involved.
>
> I no longer have the error, but, the loop running in parallel is extremely slow relative to its serialized counterpart.
>
> I have narrowed down the problem to the fact that I am searching through a very large list, grabbing the data from that list by indexing to subset and then doing stuff to it. Both "work", but the parallel version is very, very slow. I believe I am sending data files to each core and the number of searches happening is prohibitive.
>
> I am very much stuck in the design-based way of how I would do this particular problem on a single core and am not sure if there is a better designed based approach for solving this problem in the parallel version.
>
> Any advice on better ways to work with the %dopar% version here?
>
> N <- 200000
> myList <- vector('list', N)
> names(myList) <- 1:N
> for(i in 1:N){
> myList[[i]] <- rnorm(100)
> }
> nms <- 1:N
> library(foreach)
> library(doParallel)
> registerDoParallel(cores=7)
>
> result <- foreach(i = 1:3) %do% {
> dat <- myList[[which(names(myList) == nms[i])]]
> mean(dat)
> }
>
> result <- foreach(i = 1:3) %dopar% {
> dat <- myList[[which(names(myList) == nms[i])]]
> mean(dat)
> }
> -----Original Message-----
> From: R-help [mailto:r-help-bounces at r-project.org] On Behalf Of Doran, Harold
> Sent: Saturday, December 03, 2016 4:26 PM
> To: r-help at r-project.org
> Subject: [R] error serialize (foreach)
>
> I have a portion of a foreach loop that I cannot run as parallel but works fine when serialized. Below is a representation of the problem as in this instance I cannot provide reproducible data to generate the same error, the actual data I am working with are confidential.
>
> Within each foreach loop are a series of custom functions acting on my data. When using %do% I get expected result but replacing it with %dopar% generates the error.
>
> I have searched archives and also stackexchange and see this is an issue that arises and I have tried a couple of the recommendations, like trying to use an outfile in makeCluster. But I am not having success.
>
> Oddly, (or perhaps not oddly), others portions of my program run in parallel and do not generate this same error
>
> library(foreach)
> library(doParallel)
> registerDoParallel(cores=3)
>
> # This portion runs and produces expected result result <- foreach(i = 1:N) %do% {
> tmp1 <- function1(...)
> tmp2 <- function2(...)
> tmp2
> }
>
> # This portion generates error in serialize result <- foreach(i = 1:N) %dopar% {
> tmp1 <- function1(...)
> tmp2 <- function2(...)
> tmp2
> }
>
> error in serialize(data, node$con) : error writing to connection
>
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
--
Jon Olav Skøien
Joint Research Centre - European Commission
Institute for Space, Security & Migration
Disaster Risk Management Unit
Via E. Fermi 2749, TP 122, I-21027 Ispra (VA), ITALY
jon.skoien at jrc.ec.europa.eu
Tel: +39 0332 789205
Disclaimer: Views expressed in this email are those of the individual
and do not necessarily represent official views of the European Commission.
More information about the R-help
mailing list