[R] Organizing Large Datasets
Andrei Dubovik
andrei.dubovik at gmail.com
Thu Feb 2 17:28:01 CET 2012
Recently I've run into memory problems while using data.frames for a
reasonably large dataset. I've solved those problems using arrays, and
that has provoked me to do a few benchmarks. I would like to share the
results.
Let us start with the data. There are N subjects classified into G
groups. These subjects are observed for T periods, and each
observation consists of M variables. So, this is a standard panel.
Suppose, though, that it's reasonably large, with hundreds of
variables, tens of thousands of subjects, and over a decade. As I
think, there are three common ways to organize such data. The first
way is a single table, where each row is an observation (columns are
Group, Subject, Period, plus all M variables). This is a standard way
in econometrics software, let me call it the wide format. The second
way is to have a separate table for data, where each row is an
observation for a particular variable, i.e. the columns are Subject,
Period, Variable, Value, and to have a separate table with
classification of subjects into groups. This would be a standard way
to organize data in a relational database (a star scheme). Finally,
given that I'm talking about dense data, the data can be organized as
a multidimensional array (subjects, periods, variables), plus one
would need vectors with names for the elements of each of the
dimensions.
I did two benchmarks: 1) creating random data in the respective
format, and 2) aggregating over groups. As data.table can be faster
than data.frame, I've included both. Here is the source code:
https://docs.google.com/uc?id=0B-uoYmSQJJvwNTdjNzljZjUtZmVhYS00ZTQ5LTgyMjEtYmJhMjg1OTBhOTU5
The results, in brief, are as follows. Long format (star scheme) is
dominated by all other options w.r.t. time and memory usage (no big
surprise, R is not MySQL). Concerning the wide format, data.table is
faster and more memory efficient than data.frame. Finally, the wide
format with a data.table and the array format are similar in execution
times, but the array format requires less memory. More importantly, if
I need to do aggregations over variables, then the wide.format is not
that suitable anymore, whereas the array can be applied just as
before. So, a data.cube package anyone?
Andrei.
More information about the R-help
mailing list