[BioC] bioconductor on EMR / mapreduce
Dan Tenenbaum
dtenenba at fhcrc.org
Wed Sep 26 02:48:36 CEST 2012
On Mon, Sep 24, 2012 at 11:28 PM, seth redmond <seth.redmond at pasteur.fr> wrote:
> I'm not too worried about set-up times; once I'd bootstrapped libraries into
> place I could control the proportion of setup time per node by increasing
> the granularity - besides, with the volume of data I'm looking at I don't
> expect it to be a major issue, but having to switch between MPI and hadoop
> for my clusters will be.
>
> This particular case seems to be to be a simple package dependency (though
> I'm not sure recompiling R on the EMR image is something I want to get
> into), however it's not likely to be the last one I run into. So I'm
> wondering how complex it would be to, for instance, compile an R library on
> the machine image and then transfer it into place for each run? I guess this
> would be a factor of how many dependencies bioC has outside of the packages?
> (for running, that is, not compiling) - obviously samtools, and similar, but
> I'm thinking more of library dependencies that will be harder to debug.
Can't you accomplish these things with bootstrap scripts?
Here is an example of bootstrapping recent R into EMR:
http://www.r-bloggers.com/bootstrapping-the-latest-r-into-amazon-elastic-map-reduce/
>
> I take it the EMR portion of the bioC-in-the-cloud project has been dropped?
>
There never really was an EMR portion.
We are interested in hearing about compelling use cases, though.
Dan
> -s
>
>
> --
> Seth Redmond
> Unité Génetique et Génomique des Insectes Vecteurs
> Institut Pasteur
> 28,rue du Dr Roux
> 75724 PARIS
> seth.redmond at pasteur.fr
>
> On 24 Sep 2012, at 19:50, Dan Tenenbaum wrote:
>
> On Mon, Sep 24, 2012 at 9:42 AM, seth redmond <seth.redmond at pasteur.fr>
> wrote:
>
> I'm trying to install some bioC modules on EC2 / Elastic Mapreduce but I'm
> running into some library errors when installing (error below). Whilst I
> could install them locally on each machine, if possible I'd rather avoid the
> overhead both in terms of bootstrapping the machines, and having to check
> for library errors whenever I write a new method.
>
>
> Does anyone have any experience of running bioC in the cloud in this manner,
> and has tried, for instance, building a library in an S3 bucket and running
> directly from there, or porting the R lib wholesale when starting up the
> nodes? or is it possible to use the BioC AWS image in EMR somehow?
>
>
>
> From what I have been able to tell, AWS EMR is not very usable with R.
> It takes longer to load packages on each mapper/reducer than it does
> to run the calculation I am trying to parallelize.
>
> I've looked at other strategies like RHIPE, or good old MPI.
> Dan
>
>
>
> thanks
>
>
> -s
>
>
>
> * Installing *source* package 'DNAcopy' ...
>
> ** libs
>
> gfortran -fpic -g -O2 -c changepoints.f -o changepoints.o
>
> gcc -std=gnu99 -I/usr/share/R/include -fpic -g -O2 -c flchoose.c -o
> flchoose.o
>
> gcc -std=gnu99 -I/usr/share/R/include -fpic -g -O2 -c fphyper.c -o
> fphyper.o
>
> gcc -std=gnu99 -I/usr/share/R/include -fpic -g -O2 -c fpnorm.c -o
> fpnorm.o
>
> gfortran -fpic -g -O2 -c getbdry.f -o getbdry.o
>
> gfortran -fpic -g -O2 -c hybcpt.f -o hybcpt.o
>
> gfortran -fpic -g -O2 -c prune.f -o prune.o
>
> gcc -std=gnu99 -I/usr/share/R/include -fpic -g -O2 -c rshared.c -o
> rshared.o
>
> gfortran -fpic -g -O2 -c segmentp.f -o segmentp.o
>
> gcc -std=gnu99 -shared -o DNAcopy.so changepoints.o flchoose.o fphyper.o
> fpnorm.o getbdry.o hybcpt.o prune.o rshared.o segmentp.o -lgfortran -lm
> -L/usr/lib64/R/lib -lR
>
> /usr/bin/ld: cannot find -lgfortran
>
> collect2: ld returned 1 exit status
>
> make: *** [DNAcopy.so] Error 1
>
> ERROR: compilation failed for package 'DNAcopy'
>
> ** Removing '/home/hadoop/R/x86_64-pc-linux-gnu-library/2.7/DNAcopy'
>
>
> The downloaded packages are in
>
> /tmp/RtmpxSeilp/downloaded_packages
>
>
>
> --
>
> Seth Redmond
>
> Unité Génetique et Génomique des Insectes Vecteurs
>
> Institut Pasteur
>
> 28,rue du Dr Roux
>
> 75724 PARIS
>
> seth.redmond at pasteur.fr
>
>
>
> [[alternative HTML version deleted]]
>
>
>
> _______________________________________________
>
> Bioconductor mailing list
>
> Bioconductor at r-project.org
>
> https://stat.ethz.ch/mailman/listinfo/bioconductor
>
> Search the archives:
> http://news.gmane.org/gmane.science.biology.informatics.conductor
>
>
More information about the Bioconductor
mailing list