[R] count.fields inconsistent with read.table?
Sam Steingold
sds at gnu.org
Fri Feb 24 16:51:39 CET 2012
> * peter dalgaard <cqnytq at tznvy.pbz> [2012-02-24 08:41:07 +0100]:
> On Feb 24, 2012, at 06:58 , Sam Steingold wrote:
>
>> batch is a vector of lines returned by readLines from a
>> NL-line-terminated file, here is the relevant section:
>> =========================================================
>> AA BB CC DD EE FF
>> GG H
>>
>> H JJ KK LL MM
>> =========================================================
>> as you can see, a line is corrupt; two CRLF's are inserted.
>
> Actually, I don't see... (It's pretty hard to count TAB characters by eye.)
how about this?
>> =========================================================
>> AA^IBB^ICC^IDD^I^I^IEE^IFF
>> GG^IH^M
>> ^M
>> H^IJJ^IKK^I^I^ILL^IMM
>> =========================================================
I replaced TAB with ^I and CR with ^M.
is this better?
here I use <TAB> and <CR> instead:
>> =========================================================
>> AA<TAB>BB<TAB>CC<TAB>DD<TAB><TAB><TAB>EE<TAB>FF
>> GG<TAB>H<CR>
>> <CR>
>> H<TAB>JJ<TAB>KK<TAB><TAB><TAB>LL<TAB>MM
>> =========================================================
so, you see, there are two data lines here: A..F - good, with 8 fields.
G..M - BAD two CRLF's inserted inside the 2nd field, turning one line
into 3 lines.
so I must drop 3 input lines from the input.
>> This is okay, I drop the bad lines, at least I hope I do:
>>
>> conn <- textConnection(batch)
>> field.counts <- count.fields(conn, sep="\t", comment.char="", quote="")
>> close(conn)
>> good <- field.counts == 8 # this should drop all bad lines
>> if (!all(good))
>> batch <- batch[good]
>> conn <- textConnection(batch)
>> ret <- read.table(conn, sep="\t", comment.char="", quote="")
>> close(conn)
>>
>> I get this error in read.table():
>>
>> Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, :
>> line 7151 did not have 8 elements
>>
>> how come?!
>
> You can do better than this in terms of providing clues for us:
> "batch" is a character vector, right? So recheck that count.fields
> returns all 8's after removal of bad lines. Also check that dimensions
> match -- is length(batch) actually the same as length(field.counts)?
batch <- lines[807000:808000]
conn <- textConnection(batch)
field.counts <- count.fields(conn, sep="\t", comment.char="", quote="")
close(conn)
good <- field.counts == length(col.names)
which(!good)
[1] 152 153
## WRONG: it should be 3 lines, 154 is also bad - see above
batch[!good]
[1] "GG\tH" ""
length(batch)
[1] 1001
length(good)
[1] 1000
## WRONG: batch, field.counts and good should have the same length
AHA! blank.lines.skip !!!
I must set it to FALSE!!!
and it does fix the problem...
> Finally, what is in line 7151?
that's the first line with a <CR>:
GG<TAB>H<CR>
>> also, is there some error recovery?
>
> Well you can try().
it appears that try gives me access to the error message, not the
erroneous data, i.e., I still have to reload the file to get the batch
string vector.
--
Sam Steingold (http://sds.podval.org/) on Ubuntu 11.10 (oneiric) X 11.0.11004000
http://www.childpsy.net/ http://www.memritv.org http://americancensorship.org
http://memri.org http://jihadwatch.org http://dhimmi.com http://iris.org.il
Democracy is like a car: you can ride it or you can run people over with it.
More information about the R-help
mailing list