[R] sscanf equivalent
Paul Roebuck
roebuck at wotan.mdacc.tmc.edu
Sun Oct 9 09:36:38 CEST 2005
On Fri, 7 Oct 2005, Prof Brian Ripley wrote:
> On Fri, 7 Oct 2005, Paul Roebuck wrote:
>
> > I have a data file from which I need to read portions of
> > data but data location/quantity can change from file to file.
> > I wrote some code and have a working solution but it seems
> > wasteful to have to do it this way. Here's the contrived
> > incomplete code.
> >
> > datalines <- readLines(datafile.pathname)
> > # marker will appear on line preceding and following
> > # actual data
> > offset.data <- grep("marker", datalines)
> > datalines <- NULL
> >
> > # grab first column of each assoc dataline
> > data <- scan(datafile.pathname,
> > what = numeric(0),
> > skip = offset.data[1],
> > nlines = offset.data[2]-offset.data[1]-1,
> > flush = TRUE,
> > multi.line = FALSE,
> > quiet = TRUE)
> > # output is vector of values
> >
> > Originally wrote code to parse data from 'datalines'
> > using sub and strsplit methods but it was woefully slower
> > and more complex than using scan method. What is desired
> > is a means of invoking method like scan but with existing
> > data instead of filename.
>
> Why not use a text connection?
I tried that but result was far slower than the method above.
R> file.info(datafile.pathname)$size
[1] 944850
R> system.time(datalines<-readLines(datafile.pathname), TRUE)[3]
[1] 0.59
R> length(datalines)
[1] 67931
R> system.time(tconn<-textConnection(datalines), TRUE)[3]
[1] 52.97
Once a textConnection object was created, the scan method
invocation using it took less than half the time of the
corresponding filename-based invocation. Problem is that
this was only taking a second to perform the scan using
the filename-based invocation. And since grep method doesn't
accept textConnection as argument, I still require the
otherwise unused 'datalines' variable and its associated
memory. Even if grep supported such, the timing increased
even more not having the variable.
R> system.time(tconn<-textConnection(readLines(datafile.pathname)), TRUE)[3]
[1] 66.61
Any other thoughts?
# R version 2.1.1, 2005-06-20, powerpc-apple-darwin7.9.0
----------------------------------------------------------
SIGSIG -- signature too long (core dumped)
More information about the R-help
mailing list