[R] "read.table" and "scan" skips newlines which "count.fields" finds in Thai textfile
Andzsin
andzsinszan at gmail.com
Wed Feb 3 05:41:35 CET 2010
Hi there,
I have some problems reading in a Thai text.
Some of the newlines are skipped.
(see the contents of my file below)
R>count.fields ("my.txt", sep='\n', quote="")
[1] 1 1 1
Three lines with one item each, right?
R> scan("my.txt", what="", sep="\t", quote="")
Read 2 items
[1] "犹\x83犧癌ケ\x88 犧\x84犧」犧ア犧\x9a\n犹\x83犧癌ケ\x88 犧\x84犹謂クー"
[2] "犹\x83犧癌ケ\x88 犧\x84犧」犧ア犧\x9a\n"
Two items. Note that arguments to "count.fields" and "scan" are the same.
There is a newline within the first item ("\n").
> read.table("my.txt", encoding="UTF-8", header=F, sep="\t", quote="")
[1] V1
<0 rows> (or 0-length row.names)
Zero items.
I just reduced my file to zero.
Needless to say, my editors show 3 lines (Vim, Em, Hidemaru)
Hex dump shows the newline chars clearly (see below).
I have seen related questions but not the solution:
e.g.
http://n4.nabble.com/problem-with-scan-recognizing-newline-n-td896114.html#a896115
And just for fun :
R>read.table("my.txt", encoding="justkidding")
[1] V1
<0 rows> (or 0-length row.names)
Its funny NOT to see any complaints about "justkidding" encoding...
(it is so not R-ish :-)
We are using R2.8 -> for a while we are stuck with it.
(I briefly installed R2.10 but did not seem to overcome the problem)
Any kind of help is greatly appreciated.
Best,
andzsin
ps : replacing Thai with Japanese text (same utf-8) had slightly different
results
(only some of the newlines were ignored)
******** Details: *************
name : my.txt
lg : Thai
enc : UTF-8
EOL : CR+LF (0d0a)
content :
ใช่ ครับ
ใช่ ค่ะ
ใช่ ครับ
[EOF]
HEX : <copy-paste to some prg that goes with fixed-width chars>
00000000 e0b9 83e0 b88a e0b9 8820 e0b8 84e0 b8a3 `9.`8.`9. `8.`8#
00000010 e0b8 b1e0 b89a 0d0a e0b9 83e0 b88a e0b9 `81`8...`9.`8.`9
^^^^
00000020 8820 e0b8 84e0 b988 e0b8 b00d 0ae0 b983 . `8.`9.`80..`9.
^^^^^
00000030 e0b8 8ae0 b988 20e0 b884 e0b8 a3e0 b8b1 `8.`9. `8.`8#`81
00000040 e0b8 9a0d 0a `8...
^^^^^
R>sessionInfo()
R version 2.8.0 (2008-10-20)
i386-pc-mingw32
locale:
LC_COLLATE=Japanese_Japan.932;LC_CTYPE=Japanese_Japan.932;LC_MONETARY=Japanese_Japan.932;LC_NUMERIC=C;LC_TIME=Japanese_Japan.932
--
View this message in context: http://n4.nabble.com/read-table-and-scan-skips-newlines-which-count-fields-finds-in-Thai-textfile-tp1460736p1460736.html
Sent from the R help mailing list archive at Nabble.com.
More information about the R-help
mailing list