[R] Speeding reading of large file

Wed Nov 28 21:14:54 CET 2012

Once you've read the data with readLines()

head(raw) gets you the first 6 lines or
head(nchar(raw)) the lengths of those lines

Once you know which row has the headers (here it is 2), just use

names(dta) <- read.table(text=raw[2], stringsAsFactors=FALSE)

to add column names to the data frame.

-------
David

> -----Original Message-----
> From: Fisher Dennis [mailto:fisher at plessthan.com]
> Sent: Wednesday, November 28, 2012 1:42 PM
> To: dcarlson at tamu.edu
> Cc: r-help at r-project.org
> Subject: Re: [R] Speeding reading of large file
> 
> An interesting approach -- I lose the column names (which I need) but I
> could get them with something cute such as:
> 	1.  read the first few lines only with readLines(FILENAME, n=10)
> 	2.  use your approach to read.table -- this will grab the column
> names
> 	3.  replace the headers in the full version with the correct
> column names
> 
> Dennis Fisher MD
> P < (The "P Less Than" Company)
> Phone: 1-866-PLessThan (1-866-753-7784)
> Fax: 1-866-PLessThan (1-866-753-7784)
> www.PLessThan.com
> 
> On Nov 28, 2012, at 11:32 AM, David L Carlson wrote:
> 
> > Using your first approach, this should be faster
> >
> > raw <- readLines(con=filename)
> > dta <- read.table(text=raw[!grepl("[A:DF:Z]" ,raw)], header=FALSE)
> >
> > ----------------------------------------------
> > David L Carlson
> > Associate Professor of Anthropology
> > Texas A&M University
> > College Station, TX 77843-4352
> >
> >> -----Original Message-----
> >> From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-
> >> project.org] On Behalf Of Fisher Dennis
> >> Sent: Wednesday, November 28, 2012 11:43 AM
> >> To: r-help at r-project.org
> >> Subject: [R] Speeding reading of large file
> >>
> >> R 2.15.1
> >> OS X and Windows
> >>
> >> Colleagues,
> >>
> >> I have a file that looks that this:
> >> TABLE NO.  1
> >> PTID        TIME        AMT         FORM        PERIOD      IPRED
> >> CWRES       EVID        CP          PRED        RES         WRES
> >>  2.0010E+03  3.9375E-01  5.0000E+03  2.0000E+00  0.0000E+00
> >> 0.0000E+00  0.0000E+00  1.0000E+00  0.0000E+00  0.0000E+00
> 0.0000E+00
> >> 0.0000E+00
> >>  2.0010E+03  8.9583E-01  5.0000E+03  2.0000E+00  0.0000E+00
> >> 3.3389E+00  0.0000E+00  1.0000E+00  0.0000E+00  3.5321E+00
> 0.0000E+00
> >> 0.0000E+00
> >>  2.0010E+03  1.4583E+00  5.0000E+03  2.0000E+00  0.0000E+00
> >> 5.8164E+00  0.0000E+00  1.0000E+00  0.0000E+00  5.9300E+00
> 0.0000E+00
> >> 0.0000E+00
> >>  2.0010E+03  1.9167E+00  5.0000E+03  2.0000E+00  0.0000E+00
> >> 8.3633E+00  0.0000E+00  1.0000E+00  0.0000E+00  8.7011E+00
> 0.0000E+00
> >> 0.0000E+00
> >>  2.0010E+03  2.4167E+00  5.0000E+03  2.0000E+00  0.0000E+00
> >> 1.0092E+01  0.0000E+00  1.0000E+00  0.0000E+00  1.0324E+01
> 0.0000E+00
> >> 0.0000E+00
> >>  2.0010E+03  2.9375E+00  5.0000E+03  2.0000E+00  0.0000E+00
> >> 1.1490E+01  0.0000E+00  1.0000E+00  0.0000E+00  1.1688E+01
> 0.0000E+00
> >> 0.0000E+00
> >>  2.0010E+03  3.4167E+00  5.0000E+03  2.0000E+00  0.0000E+00
> >> 1.2940E+01  0.0000E+00  1.0000E+00  0.0000E+00  1.3236E+01
> 0.0000E+00
> >> 0.0000E+00
> >>  2.0010E+03  4.4583E+00  5.0000E+03  2.0000E+00  0.0000E+00
> >> 1.1267E+01  0.0000E+00  1.0000E+00  0.0000E+00  1.1324E+01
> 0.0000E+00
> >> 0.0000E+00
> >>
> >> The file is reasonably large (> 10^6 lines) and the two line header
> is
> >> repeated periodically in the file.
> >> I need to read this file in as a data frame.  Note that the number
> of
> >> columns, the column headers, and the number of replicates of the
> >> headers are not known in advance.
> >>
> >> I have tried two approaches to this:
> >> 	First Approach:
> >> 		1.  readLines(FILENAME) to read in the file
> >> 		2.  use grep to find the repeat headers; strip out the
> >> repeat headers
> >> 		3.  write() the object to tempfile, read in that temporary
> >> file using read.table(tempfile, header=TRUE, skip=1) [an alternative
> is
> >> to use textConnection but that does not appear to speed things]
> >>
> >> 	Second Approach:
> >> 		1.  TEMP	<- read.table(FILENAME, header=TRUE, skip=1,
> >> fill=TRUE, as.is=TRUE)
> >> 		2.  get rid of the errant entries with:
> >> 			TEMP[!is.na(as.numeric(TEMP[,1])),]
> >> 		3.  reading of the character entries forced all columns to
> >> character mode.  Therefore, I convert each column to numeric:
> >> 			for (COL in 1:ncol(TEMP)) TEMP[,COL] <-
> >> as.numeric(TEMP[,COL])
> >> The second approach is ~ 20% faster than the first.  With the second
> >> approach, the conversion to numeric occupies 50% of the elapsed
> time.
> >>
> >> Is there some approach that would be much faster?  For example,
> would a
> >> vectorized approach to conversion to numeric improve throughput?
> Or,
> >> is there some means to ensure that all data are read as numeric (I
> >> tried to use colClasses but that triggered an error when the text
> >> string was encountered).
> >>
> >> ############################
> >> A dput version of the data is:
> >> c("TABLE NO.  1", " PTID        TIME        AMT         FORM
> >> PERIOD      IPRED       CWRES       EVID        CP          PRED
> >> RES         WRES",
> >> "  2.0010E+03  3.9375E-01  5.0000E+03  2.0000E+00  0.0000E+00
> >> 0.0000E+00  0.0000E+00  1.0000E+00  0.0000E+00  0.0000E+00
> 0.0000E+00
> >> 0.0000E+00",
> >> "  2.0010E+03  8.9583E-01  5.0000E+03  2.0000E+00  0.0000E+00
> >> 3.3389E+00  0.0000E+00  1.0000E+00  0.0000E+00  3.5321E+00
> 0.0000E+00
> >> 0.0000E+00",
> >> "  2.0010E+03  1.4583E+00  5.0000E+03  2.0000E+00  0.0000E+00
> >> 5.8164E+00  0.0000E+00  1.0000E+00  0.0000E+00  5.9300E+00
> 0.0000E+00
> >> 0.0000E+00",
> >> "  2.0010E+03  1.9167E+00  5.0000E+03  2.0000E+00  0.0000E+00
> >> 8.3633E+00  0.0000E+00  1.0000E+00  0.0000E+00  8.7011E+00
> 0.0000E+00
> >> 0.0000E+00",
> >> "  2.0010E+03  2.4167E+00  5.0000E+03  2.0000E+00  0.0000E+00
> >> 1.0092E+01  0.0000E+00  1.0000E+00  0.0000E+00  1.0324E+01
> 0.0000E+00
> >> 0.0000E+00",
> >> "  2.0010E+03  2.9375E+00  5.0000E+03  2.0000E+00  0.0000E+00
> >> 1.1490E+01  0.0000E+00  1.0000E+00  0.0000E+00  1.1688E+01
> 0.0000E+00
> >> 0.0000E+00",
> >> "  2.0010E+03  3.4167E+00  5.0000E+03  2.0000E+00  0.0000E+00
> >> 1.2940E+01  0.0000E+00  1.0000E+00  0.0000E+00  1.3236E+01
> 0.0000E+00
> >> 0.0000E+00",
> >> "  2.0010E+03  4.4583E+00  5.0000E+03  2.0000E+00  0.0000E+00
> >> 1.1267E+01  0.0000E+00  1.0000E+00  0.0000E+00  1.1324E+01
> 0.0000E+00
> >> 0.0000E+00"
> >> )
> >>
> >> This can be assembled into a large dataset and written to a file
> named
> >> FILENAME with the following code:
> >> cat(c("TABLE NO.  1", " PTID        TIME        AMT         FORM
> >> PERIOD      IPRED       CWRES       EVID        CP          PRED
> >> RES         WRES",
> >> "  2.0010E+03  3.9375E-01  5.0000E+03  2.0000E+00  0.0000E+00
> >> 0.0000E+00  0.0000E+00  1.0000E+00  0.0000E+00  0.0000E+00
> 0.0000E+00
> >> 0.0000E+00",
> >> "  2.0010E+03  8.9583E-01  5.0000E+03  2.0000E+00  0.0000E+00
> >> 3.3389E+00  0.0000E+00  1.0000E+00  0.0000E+00  3.5321E+00
> 0.0000E+00
> >> 0.0000E+00",
> >> "  2.0010E+03  1.4583E+00  5.0000E+03  2.0000E+00  0.0000E+00
> >> 5.8164E+00  0.0000E+00  1.0000E+00  0.0000E+00  5.9300E+00
> 0.0000E+00
> >> 0.0000E+00",
> >> "  2.0010E+03  1.9167E+00  5.0000E+03  2.0000E+00  0.0000E+00
> >> 8.3633E+00  0.0000E+00  1.0000E+00  0.0000E+00  8.7011E+00
> 0.0000E+00
> >> 0.0000E+00",
> >> "  2.0010E+03  2.4167E+00  5.0000E+03  2.0000E+00  0.0000E+00
> >> 1.0092E+01  0.0000E+00  1.0000E+00  0.0000E+00  1.0324E+01
> 0.0000E+00
> >> 0.0000E+00",
> >> "  2.0010E+03  2.9375E+00  5.0000E+03  2.0000E+00  0.0000E+00
> >> 1.1490E+01  0.0000E+00  1.0000E+00  0.0000E+00  1.1688E+01
> 0.0000E+00
> >> 0.0000E+00",
> >> "  2.0010E+03  3.4167E+00  5.0000E+03  2.0000E+00  0.0000E+00
> >> 1.2940E+01  0.0000E+00  1.0000E+00  0.0000E+00  1.3236E+01
> 0.0000E+00
> >> 0.0000E+00",
> >> "  2.0010E+03  4.4583E+00  5.0000E+03  2.0000E+00  0.0000E+00
> >> 1.1267E+01  0.0000E+00  1.0000E+00  0.0000E+00  1.1324E+01
> 0.0000E+00
> >> 0.0000E+00"
> >> )[rep(1:10, 1000)], file="FILENAME", sep="\n")
> >>
> >>
> >> Dennis
> >>
> >>
> >> Dennis Fisher MD
> >> P < (The "P Less Than" Company)
> >> Phone: 1-866-PLessThan (1-866-753-7784)
> >> Fax: 1-866-PLessThan (1-866-753-7784)
> >> www.PLessThan.com
> >>
> >> ______________________________________________
> >> R-help at r-project.org mailing list
> >> https://stat.ethz.ch/mailman/listinfo/r-help
> >> PLEASE do read the posting guide http://www.R-project.org/posting-
> >> guide.html
> >> and provide commented, minimal, self-contained, reproducible code.
> >