[R] A weird observation from using read.table
Charles C. Berry
cberry at tajo.ucsd.edu
Thu Sep 27 20:53:34 CEST 2007
On Thu, 27 Sep 2007, Jun Ding wrote:
> Hi Everyone,
>
> Recently I got puzzled by the function read.table,
> even though I have used it for a long time.
>
> I have such a file (tmp.txt, 2 rows and 3 columns,
> with a space among columns):
>
> 1 2'-PDE 4
> 2 3'-PDE 5
>
> if I do:
> a = read.table("tmp.txt", header = F, quote = "")
> a
> V1 V2 V3
> 1 1 2'-PDE 4
> 2 2 3'-PDE 5
>
> Everything is fine.
>
> However, if I do:
> a = read.table("tmp.txt", header = F)
> a
> V1 V2 V3
> 1 2 3'-PDE 5
> 2 1 2'-PDE 4
> 3 2 3'-PDE 5
>
> I know it is related to the "quote" as the default
> includes '. But how can it get one more row in the
> file? Thank you very much for your help in advance!
read.table does a lot of work trying to figure out what kind of data it
will see and doing preliminary checks on it before swallowing the whole
file. It reads the first 5 lines of data thru a file() connection - if
there are five lines - and then tries to pushBack() two copies of those
lines. Then it rereads half of these and skips the extra header row if
there is one. At that point, it should be positioned to read all of the
data that was in the original file.
Declaring a quote that should not be a quote really messes this up. I
think this happens because the internal function readTableHead will ignore
newlines that are between quotes. In your example all of the data is read
by readTableHead as one line because of a quote on the first line, and
this has downstream consequences that result in not repositioning the
connection at the right place. And that leads to reading two copies of the
second line in your example.
If you want more details, use debug(read.table) and then run your
examples. print 'lines', 'nlines', and 'pushBackLength( file )' at various
points in the execution of read.table and you can see what is happening.
HTH,
Chuck
>
> Jun
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
Charles C. Berry (858) 534-2098
Dept of Family/Preventive Medicine
E mailto:cberry at tajo.ucsd.edu UC San Diego
http://famprevmed.ucsd.edu/faculty/cberry/ La Jolla, San Diego 92093-0901
More information about the R-help
mailing list