[R] Using plyr::dply more (memory) efficiently?

Thu Apr 29 15:06:53 CEST 2010

Hi all,

In short:

I'm running ddply on an admittedly (somehow) large data.frame (not
that large). It runs fine until it finishes and gets to the
"collating" part where all subsets of my data.frame have been
summarized and they are being reassembled into the final summary
data.frame (sorry, don't know the correct plyr terminology). During
collation, my R workspace RAM usage goes from about 1.5 GB upto 20GB
until I kill it.

Running a similar piece of code that iterates manually w/o ddply by
using a combo of lapply and a do.call(rbind, ...) uses considerably
less ram (tops out at about 8GB).

How can I use ddply more efficiently?

Longer:

Here's more info:

 * The data.frame itself ~ 15.8 MB when loaded.
 * ~ 400,000 rows, 8 columns

It looks like so:

   exon.start exon.width exon.width.unique exon.anno counts
symbol   transcript  chr
1        4225        468                 0       utr      0
WASH5P       WASH5P chr1
2        4833         69                 0       utr      1
WASH5P       WASH5P chr1
3        5659        152                38       utr      1
WASH5P       WASH5P chr1
4        6470        159                 0       utr      0
WASH5P       WASH5P chr1
5        6721        198                 0       utr      0
WASH5P       WASH5P chr1
6        7096        136                 0       utr      0
WASH5P       WASH5P chr1
7        7469        137                 0       utr      0
WASH5P       WASH5P chr1
8        7778        147                 0       utr      0
WASH5P       WASH5P chr1
9        8131         99                 0       utr      0
WASH5P       WASH5P chr1
10      14601        154                 0       utr      0
WASH5P       WASH5P chr1
11      19184         50                 0       utr      0
WASH5P       WASH5P chr1
12       4693        140                36    intron      2
WASH5P       WASH5P chr1
13       4902        757                36    intron      1
WASH5P       WASH5P chr1
14       5811        659               144    intron     47
WASH5P       WASH5P chr1
15       6629         92                21    intron      1
WASH5P       WASH5P chr1
16       6919        177                 0    intron      0
WASH5P       WASH5P chr1
17       7232        237                35    intron      2
WASH5P       WASH5P chr1
18       7606        172                 0    intron      0
WASH5P       WASH5P chr1
19       7925        206                 0    intron      0
WASH5P       WASH5P chr1
20       8230       6371               109    intron     67
WASH5P       WASH5P chr1
21      14755       4429                55    intron     12
WASH5P       WASH5P chr1
...

I'm "ply"-ing over the "transcript" column and the function transforms
each such subset of the data.frame into a new data.frame that is just
1 row / transcript that basically has the sum of the "counts" for each
transcript.

The code would look something like this (`summaries` is the data.frame
I'm referring to):

rpkm <- ddply(summaries, .(transcript), function(df) {
  data.frame(symbol=df$symbol[1], counts=sum(df$counts))
}

(It actually calculates 2 more columns that are returned in the
data.frame, but I'm not sure that's really important here).

To test some things out, I've written another function to manually
iterate/create subsets of my data.frame to summarize.

I'm using sqldf to dump the data.frame into a db, then I lapply over
subsets of the db `where transcript=x` to summarize each subset of my
data into a list of single-row data.frames (like ddply is doing), and
finish with a `do.call(rbind, the.dfs)` o nthis list.

This returns the same exact result ddply would return, and by the time
`do.call` finishes, my RAM usage hits about 8gb.

So, what am I doing wrong with ddply that makes the difference ram
usage in the last step ("collation" -- the equivalent of my final
`do.call(rbind, my.dfs)` be more than 12GB?

Thanks,
-steve

-- 
Steve Lianoglou
Graduate Student: Computational Systems Biology
 | Memorial Sloan-Kettering Cancer Center
 | Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact