[R] good and bad ways to import fixed column data (rpy)
Ross Boylan
ross at biostat.ucsf.edu
Mon Aug 17 06:16:29 CEST 2009
Just to quote explicitly the passage I mentioned in the R Data document:
<QUOTE>
Function `read.fwf' provides a simple way to read such files,
specifying a vector of field widths. The function reads the file into
memory as whole lines, splits the resulting character strings, writes
out a temporary tab-separated file and then calls `read.table'. This
is adequate for small files, but for anything more complicated we
recommend using the facilities of a language like `perl' to pre-process
the file.
</QUOTE>
Note particularly the final sentence.
Ross
On Sun, 2009-08-16 at 19:37 -0400, Wensui Liu wrote:
> Gabor made a good point.
> Here is an example I copied from my blog.
>
> ##############################################
> # READ FIXED-WIDTH DATA FILE WITH read.fwf() #
> # ------------------------------------------ #
> # EQUIVALENT SAS CODE: #
> # filename data 'E:\sas\fixed.txt'; #
> # data test; #
> # infile data truncover; #
> # input @1 city $ 1 - 22 @23 population; #
> # run; #
> ##############################################
>
> # OPEN A CONNECTION TO THE DATA FILE
> data <- file(description = "e:\\sas\\fixed.txt", open = "r")
>
> # width = c(...) ==> SPECIFIES COLUMN WIDTHS
> # col.names = c(...) ==> GIVES COLUMN NAMES
> # colClasses = c(...) ==> DEFINES COLUMN CLASSES
> test <- read.fwf(data, header = FALSE, width = c(22, 10),
> col.names = c("city", "population"),
> colClasses = c("character", "numeric"))
>
> close(data)
>
> On Sun, Aug 16, 2009 at 6:36 PM, Gabor
> Grothendieck<ggrothendieck at gmail.com> wrote:
> > Check out ?read.fwf
> >
> > On Sun, Aug 16, 2009 at 4:49 PM, Ross Boylan<ross at biostat.ucsf.edu>
> wrote:
> >> Recorded here so others may avoid my mistakes.
> >>
> >> I have a bunch of files containing fixed width data. The R Data
> guide
> >> suggests that one pre-process them with a script if they are large.
> >> They were 50MG and up, and I needed to process another file that
> gave
> >> the layout of the lines anyway.
> >>
> >> I tried rpy to not only preprocess but create the R data object in
> one
> >> go. It seemed like a good idea; it wasn't. The core operation,
> was to
> >> build up a string for each line that looked like
> "data.frame(var1=val1,
> >> var2=val2, [etc])" and then rbind this to the data.frame so far. I
> did
> >> this with r(mycommand string). Almost all the values were numeric.
> >>
> >> This was incredibly slow, being unable to complete after running
> >> overnight.
> >>
> >> So, the lesson is, don't do that!
> >>
> >> I switched to preprocessing that created a csv file, and then
> read.csv
> >> from R. This worked in under a minute. The result had dimension
> 150913
> >> x 129.
> >>
> >> The good news in rpy was that I found objects persisted across
> calls to
> >> the r object.
> >>
> >> Exactly why this was so slow I don't know. The two obvious
> suspects the
> >> speed of rbind, which I think is pretty inefficient, and the
> overhead of
> >> crossing the python/R boundary.
> >>
> >> This was on Debian Lenny:
> >> python-rpy 1.0.3-2
> >> Python 2.5.2
> >> R 2.7.1
> >>
> >> rpy2 is not available in Lenny, though it is in development
> versions of
> >> Debian.
> >>
> >> Ross Boylan
> >>
> >> ______________________________________________
> >> R-help at r-project.org mailing list
> >> https://stat.ethz.ch/mailman/listinfo/r-help
> >> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> >> and provide commented, minimal, self-contained, reproducible code.
> >>
> >
> > ______________________________________________
> > R-help at r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
> >
>
>
>
> --
> ==============================
> WenSui Liu
> Blog : statcompute.spaces.live.com
> Tough Times Never Last. But Tough People Do. - Robert Schuller
> ==============================
>
>
More information about the R-help
mailing list