[R] R/S and large datasets - Database access (also Re: SAS and S/R)

Timothy H. Keitt tklistaddr at keittlab.bio.sunysb.edu
Wed Nov 28 19:27:36 CET 2001

Emmanuel Charpentier wrote:

> A consensus seems to emerge : R would excel to exploratory work on 
> small/middle-sized datasets, while SAS would be able to munch much 
> larger datasets.
> However, I see the "size" problem as a red herring. The objects that 
> have to stay "in core" are usually much smaller than the dataset. For 
> example, for problems involving fixed-effects linear models, you need 
> only some matrices whose size is proportional to the square of the 
> number of *variables* and the (admittedly large) vector of residues 
> (whose size is equl to the number of observations). Other cases 
> (nonlinear mixed effects models come to mind) are not as easily tamed 
> (any iterative process (shuch as ML estimation) has to get back  to 
> original data), but at least, the time penalty involved in the use of 
> such an interface pays back by allowing you to treat problems 
> otherwise untractable.
> I am aware of at least one database access package that allows to 
> access data without dragging a whole table in memory : the RPgSql 
> package offers what it calls a "proxy variable", which is an objet 
> that behaves, for all practical purposes, as a dataframe, but is an 
> interface to database tables. I see this kind of interface as a way to 
> avoid overloading core memory with data scarcely used.
> Unfortunately, the said package is now officially orphaned by its 
> developper, which states that he now focuses on the next database 
> access standard : the Rdbi interface, which is currently under 
> development, and which I don't know a thing about.
> So the question is : do the Rdbi interface offers such a proxy to data 
> still residing in databases ?
> Or am I barking up the wrong tree and trying to (re-)invent an 
> oversophisticated virtual memory manager ?  SShould the use of a 
> suficiently large swapfile be enough for these "large dataset" problems ?
The problem with proxy data frames is that you can't pass them to 
functions like 'lm' (at least when I tried it long ago), because the 
functions that make the proxy object look like a data frame only exist 
at the R level. When you drop down to internal C code, you call a 
different set of (non-overloadable) functions, so it just appears as a 
scalar object. Duncan's news about the generic "attach" interface may 
soon make this possible however. Actually, I've found that having 
learned some SQL, I now find it indespensible. As you say, generally you 
only work with a small subset of your data, and SQL queries is the best 
way I've found to do the subsetting.

Also, there has been some recent discussion of a proposed generic DBI 
interface for R/S. Rdbi was my attempt (actually what I originally set 
out to do with RPgSQL, but some necessary internal functions were not 
yet documented or in some cases not yet implemented). We more-or-less 
settled on David James' proposal, but I do not know if anyone is 
actually implementing it. It would be nice to have a reference 
implementation so we can try it out and see what we do or don't like. I 
hope to see all of this resolved soon as I have less and less time to 
put into it and my interests are moving elsewhere (e.g., more GIS 


Timothy H. Keitt
Department of Ecology and Evolution
State University of New York at Stony Brook
Stony Brook, New York 11794 USA
Phone: 631-632-1101, FAX: 631-632-7626

r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-help-request at stat.math.ethz.ch

More information about the R-help mailing list