[R] Using plyr::dply more (memory) efficiently?
Steve Lianoglou
mailinglist.honeypot at gmail.com
Thu Apr 29 15:06:53 CEST 2010
Hi all,
In short:
I'm running ddply on an admittedly (somehow) large data.frame (not
that large). It runs fine until it finishes and gets to the
"collating" part where all subsets of my data.frame have been
summarized and they are being reassembled into the final summary
data.frame (sorry, don't know the correct plyr terminology). During
collation, my R workspace RAM usage goes from about 1.5 GB upto 20GB
until I kill it.
Running a similar piece of code that iterates manually w/o ddply by
using a combo of lapply and a do.call(rbind, ...) uses considerably
less ram (tops out at about 8GB).
How can I use ddply more efficiently?
Longer:
Here's more info:
* The data.frame itself ~ 15.8 MB when loaded.
* ~ 400,000 rows, 8 columns
It looks like so:
exon.start exon.width exon.width.unique exon.anno counts
symbol transcript chr
1 4225 468 0 utr 0
WASH5P WASH5P chr1
2 4833 69 0 utr 1
WASH5P WASH5P chr1
3 5659 152 38 utr 1
WASH5P WASH5P chr1
4 6470 159 0 utr 0
WASH5P WASH5P chr1
5 6721 198 0 utr 0
WASH5P WASH5P chr1
6 7096 136 0 utr 0
WASH5P WASH5P chr1
7 7469 137 0 utr 0
WASH5P WASH5P chr1
8 7778 147 0 utr 0
WASH5P WASH5P chr1
9 8131 99 0 utr 0
WASH5P WASH5P chr1
10 14601 154 0 utr 0
WASH5P WASH5P chr1
11 19184 50 0 utr 0
WASH5P WASH5P chr1
12 4693 140 36 intron 2
WASH5P WASH5P chr1
13 4902 757 36 intron 1
WASH5P WASH5P chr1
14 5811 659 144 intron 47
WASH5P WASH5P chr1
15 6629 92 21 intron 1
WASH5P WASH5P chr1
16 6919 177 0 intron 0
WASH5P WASH5P chr1
17 7232 237 35 intron 2
WASH5P WASH5P chr1
18 7606 172 0 intron 0
WASH5P WASH5P chr1
19 7925 206 0 intron 0
WASH5P WASH5P chr1
20 8230 6371 109 intron 67
WASH5P WASH5P chr1
21 14755 4429 55 intron 12
WASH5P WASH5P chr1
...
I'm "ply"-ing over the "transcript" column and the function transforms
each such subset of the data.frame into a new data.frame that is just
1 row / transcript that basically has the sum of the "counts" for each
transcript.
The code would look something like this (`summaries` is the data.frame
I'm referring to):
rpkm <- ddply(summaries, .(transcript), function(df) {
data.frame(symbol=df$symbol[1], counts=sum(df$counts))
}
(It actually calculates 2 more columns that are returned in the
data.frame, but I'm not sure that's really important here).
To test some things out, I've written another function to manually
iterate/create subsets of my data.frame to summarize.
I'm using sqldf to dump the data.frame into a db, then I lapply over
subsets of the db `where transcript=x` to summarize each subset of my
data into a list of single-row data.frames (like ddply is doing), and
finish with a `do.call(rbind, the.dfs)` o nthis list.
This returns the same exact result ddply would return, and by the time
`do.call` finishes, my RAM usage hits about 8gb.
So, what am I doing wrong with ddply that makes the difference ram
usage in the last step ("collation" -- the equivalent of my final
`do.call(rbind, my.dfs)` be more than 12GB?
Thanks,
-steve
--
Steve Lianoglou
Graduate Student: Computational Systems Biology
| Memorial Sloan-Kettering Cancer Center
| Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact
More information about the R-help
mailing list