[R] read in large data file (tsv) with inline filter?

Gabor Grothendieck ggrothendieck at gmail.com
Tue Mar 24 12:45:24 CET 2009


On Mon, Mar 23, 2009 at 5:53 PM, David Reiss <dreiss at systemsbiology.org> wrote:
> I have a very large tab-delimited file, too big to store in memory via
> readLines() or read.delim(). Turns out I only need a few hundred of those
> lines to be read in. If it were not so large, I could read the entire file
> in and "grep" the lines I need. For such a large file; many calls to
> read.delim() with incrementing "skip" and "nrows" parameters, followed by
> grep() calls is very slow. I am aware of possibilities via SQLite; I would
> prefer to not use that in this case.
>

Why the restriction?  Using sqldf with sqlite its only
two R statements.

The first statement defines the name of the file and the
second statement defines what you want to extract from
it and its format.  First create a sample file using built in
data frame BOD and lets specify tab delimiters.  A header,
i.e. col.names = TRUE, is the default for write.table:

# create tab delimited test file with headers
write.table(BOD, file = "myfile.dat", sep = "\t")

# now read in only those records satisfying a condition
library(sqldf)

# here are the two statements
myfile <- file("myfile.dat")
DF <- sqldf("select * from myfile where demand < 10 or demand > 15",
  file.format = list(sep = "\t"))

See examples on the home page http://sqldf.googlecode.com and
see ?sqldf for info on the file.format argument in case your format differs.




More information about the R-help mailing list