[R] using sqldf's read.csv.sql to read a file with "NA" for missing

Thu Sep 6 00:19:46 CEST 2012

On Wed, Sep 5, 2012 at 5:31 PM, David Reiner <David.Reiner at xrtrading.com> wrote:
> I'm trying to use sqdf's function read.csv.sql to read CSV files in which the missing values are represented by NA's.
> Plain old read.csv works fine on these files, but they are rather large and I'd like to filter using sql-like statements.
> However, even if I specify field.types correctly and nrows=-1, it still turns the columns with NA's into chars or 0.
> I'm trying to make this OS independent, so I don't think I can use a filter to convert the NA's to NULL's or whatever SQLite would understand.
> I can accept it everything has to be read in as char and then convert to doubles with as.numeric, but I'm looking for speed.
>
> Here is code I thought would read the file (I've attached a small sample.)
> It almost works if there are no NA's in the initial rows, but it still turns NA's into 0's instead of NA or something I can change into NA;
> and it returns characters if there are NA's in the initial rows.
> (0 is a possible value so I can't filter out the 0's.)
>
> field.types <- list(V1='char', V2='char', V3='real', V4='int', V5='real', V6='int', V7='real')
> dtst <- read.csv.sql("./tmp.csv", header=FALSE, field.types=field.types, nrows=-1)
> str(dtst)
>
> 'data.frame':   32 obs. of  7 variables:
>  $ V1: chr  "2012-07-01" "2012-07-01" "2012-07-01" "2012-07-01" ...
>  $ V2: chr  "15:50:00" "15:51:00" "15:52:00" "15:53:00" ...
>  $ V3: chr  "NA" "NA" "NA" "NA" ...
>  $ V4: int  0 0 0 0 0 0 0 0 0 0 ...
>  $ V5: chr  "NA" "NA" "NA" "NA" ...
>  $ V6: int  0 0 0 0 0 0 0 0 0 0 ...
>  $ V7: chr  "NA" "NA" "NA" "NA" ...
>

See FAQ#14 on the sqldf home page noting the part at the end of the
answer about csvfix.

-- 
Statistics & Software Consulting
GKX Group, GKX Associates Inc.
tel: 1-877-GKX-GROUP
email: ggrothendieck at gmail.com