[R] sqldf for Very Large Tab Delimited Files
Gabor Grothendieck
ggrothendieck at gmail.com
Thu Feb 2 18:57:15 CET 2012
On Thu, Feb 2, 2012 at 3:11 AM, Gabor Grothendieck
<ggrothendieck at gmail.com> wrote:
> On Wed, Feb 1, 2012 at 11:57 PM, HC <hcatbr at yahoo.co.in> wrote:
>> Hi All,
>>
>> I have a very (very) large tab-delimited text file without headers. There
>> are only 8 columns and millions of rows. I want to make numerous pieces of
>> this file by sub-setting it for individual stations. Station is given as in
>> the first column. I am trying to learn and use sqldf package for this but am
>> stuck in a couple of places.
>>
>> To simulate my requirement, I have taken iris dataset as an example and have
>> done the following:
>> (1) create a tab-delimited file without headers.
>> (2) read it using read.csv.sql command
>> (3) write the result of a query, getting first 10 records
>>
>> Here is the reproducible code that I am trying:
>> # Text data file
>> write.table(iris, "irisNoH.txt", sep = "\t", quote = FALSE,
>> col.names=FALSE,row.names = FALSE)
>> # create an empty database (can skip this step if database already exists)
>> sqldf("attach myTestdbT as new")
>> f1<-file("irisNoH.txt")
>> attr(f1, "file.format") <- list(header=FALSE,sep="\t")
>> # read into table called irisTab in the mytestdb sqlite database
>> read.csv.sql("irisNoH.txt", sql = "create table main.irisTab1 as select *
>> from file", dbname = "mytestdb")
>> res1<-sqldf("select * from main.irisTab1 limit 10", dbname = "mytestdb")
>> write.table(res1, "iris10.txt", sep = "\t", quote = FALSE,
>> col.names=FALSE,row.names = FALSE)
>>
>> # For querying records of a particular species - unresolved problems
>> #a1<-"virginica"
>> #attr(f1, "names") <- c("A1","A2","A3","A4","A5")
>> #res2<-fn$sqldf("select * from main.irisTab1 where A5 = '$a1'")
>>
>> In the above, I am not able to:
>> (1) assign the names to various columns
>> (2) query for particular value of a column; in this case for particular
>> species, say virginica
>> (3) I guess fn$sqldf can do the job but it requires assigning column names
>>
>> Any help would be most appreciated.
>>
>
> Ignoring your iris file for a moment, to query the 5th column (getting
> its name via sql rather than via R) we can do this:
>
> library(sqldf)
> species <- "virginica"
> nms <- names(dbGetQuery(con, "select * from iris limit 0"))
> fn$dbGetQuery(con, "select * from iris where `nms[5]` = '$species' limit 3")
>
> Now, sqldf is best used when you are getting the data from R but if
> you want to store it in a database and just leave it there then you
> might be better off using RSQLite directly like this (the eol = "\r\n"
> in the dbWriteTable statement was needed on my Windows system but you
> may not need that depending on your platform):
>
>
> write.table(iris, "irisNoH.txt", sep = "\t", quote = FALSE, col.names
> = FALSE, row.names = FALSE)
>
> library(sqldf)
> library(RSQLite)
>
> con <- dbConnect(SQLite(), dbname = "mytestdb")
>
> dbWriteTable(con, "iris", "irisNoH.txt", sep = "\t", eol = "\r\n")
>
> species <- "virginica"
> nms <- names(dbGetQuery(con, "select * from iris limit 0"))
> fn$dbGetQuery(con, "select * from iris where `nms[5]` = '$species' limit 3")
>
> dbDisconnect(con)
There seems to have been a pasting error here. The first part was
intended to show how to do this using sqldf and the second using
RSQLite. Thus the first part was intended to be:
library(sqldf)
species <- "virginica"
# obviously we could just do nms <- names(iris) but to get
# names from database instead
nms <- names(dbGetQuery(con, "select * from iris limit 0"))
# use 5th column
fn$sqldf("select * from iris where `nms[5]` = '$species' limit 3")
and the second part that illustrates RSQLite was ok. Note that fn$
comes from the gsubfn package which sqldf loads.
--
Statistics & Software Consulting
GKX Group, GKX Associates Inc.
tel: 1-877-GKX-GROUP
email: ggrothendieck at gmail.com
More information about the R-help
mailing list