[R] R/S and large datasets - Database access (also Re: SAS and S/R)

Emmanuel Charpentier charpent at bacbuc.dyndns.org
Tue Nov 27 19:28:06 CET 2001

David James wrote:

>The Rdbi (or perhaps simply DBI, for database interface, since it is
>meant to include both R and Splus) is a simple interface to any database
>management system or DBMS (so far only *relational* databases have been
>considered) very similar in spirit to Java's Database Connectivity (JDBC),
>Perl's Database Independent (DBI), Python's Database API.  It deals
>primarily with a common set of function to interface R and Splus to
>databases (PostgreSQL, Oracle, Access, MySQL, mSQL, etc.)  But we should
>think of this DBI only as a first step, or the infrastructure on which
>we can build more sophisticated tools.  The proxy table/variable is a
>good example of such a tool.  But if it's good for PostgreSQL tables,
>why not for Microsoft SQL tables? Or MySQL tables?  By having a common
>interface, we hope to be able to build this sort of advanced tools
>independent of the underlying DBMS.
That should make ODBC your first target ... More than half the work is 
already done by this interface.

>Other applications may include the ability to attach() any database
>to the search() path (together with the idea of proxy objects,
>it could be helpful in some cases);  also, the possibility to do
>"database apply" where we apply R functions to chunks on remote
>tables.  (Roger Koenker and his colleague have an LM example, see
>http://www.econ.uiuc.edu/~roger/research/rq/LM.html).  There has also
>been some interest of approximating quantiles, applying GLM's, etc., to
>very large datasets, but techniques like these will most likely require
>new algorithms to work sequentially.
That's something that seems to have been already on the mind of 
developpers of a large part of R. As far as I can tell, at least ...

>And of course, some also have pointed out (Brian Ripley, among others)
>that sampling has been used quite successfully before by statisticians:-)
>and thus could be quite useful in some of these cases.
That, IMHO, is aimed to a totally different set of problems. The 
sampling of a part of a ataset to elaborate a model to validate on the 
rest of the dataset is not specifc to very large datasets.

>  I'm not aware
>of any tools available yet to do this on remote DBMSes, but one would
>hope that if such a tool were to be developed, it would be done on top
>of the DBI so that it could be used with any DBMS.
The easiest way is to select the subset through SQL queries,  maybe 
creating  a small auxilliary table recording the subsetting for 
reference purposes. This does not require much tools, just a working 
knowledge of SQL and of the database structure. On large sites, with 
DBAs, the latter isn't even necessary : just request from them a view 
suiting your needs and the ability to create your subset index tables...

                                        Emmanuel Charpentier

r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-help-request at stat.math.ethz.ch

More information about the R-help mailing list