[R] Non-ACSII characters in R on Windows

Tue Sep 17 14:15:36 CEST 2013

Le lundi 16 septembre 2013 à 20:04 +0400, Maxim Linchits a écrit :
> Here is that old post:
> http://r.789695.n4.nabble.com/read-csv-and-FileEncoding-in-Windows-version-of-R-2-13-0-td3567177.html
> 
> A taste: "Again, the issue is that opening this UTF-8 encoded file
> under R 2.13.0 yields an error, but opening it under R 2.12.2 works
> without any issues. (...)"
I have tried with R 2.12.2 both 32 and 64 bit on Windows Server 2008
with the French (CP1252) locale, and I still experience an error with
the test case I provided in previous messages. So it does not sound like
it is the same issue.


Regards

> On Mon, Sep 16, 2013 at 6:38 PM, Milan Bouchet-Valat <nalimilan at club.fr> wrote:
> > Le lundi 16 septembre 2013 à 10:40 +0200, Milan Bouchet-Valat a écrit :
> >> Le vendredi 13 septembre 2013 à 23:38 +0400, Maxim Linchits a écrit :
> >> > This is a condensed version of the same question on stackexchange here:
> >> > http://stackoverflow.com/questions/18789330/r-on-windows-character-encoding-hell
> >> > If you've already stumbled upon it feel free to ignore.
> >> >
> >> > My problem is that R on US Windows does not read *any* text file that
> >> > contains *any* foreign characters. It simply reads the first consecutive n
> >> > ASCII characters and then throws a warning once it reached a foreign
> >> > character:
> >> >
> >> > > test <- read.table("test.txt", sep=";", dec=",", quote="",
> >> > fileEncoding="UTF-8")
> >> > Warning messages:
> >> > 1: In read.table("test.txt", sep = ";", dec = ",", quote = "", fileEncoding
> >> > = "UTF-8") :
> >> >   invalid input found on input connection 'test.txt'
> >> > 2: In read.table("test.txt", sep = ";", dec = ",", quote = "", fileEncoding
> >> > = "UTF-8") :
> >> >   incomplete final line found by readTableHeader on 'test.txt'
> >> > > print(test)
> >> >        V1
> >> > 1 english
> >> >
> >> > > Sys.getlocale()
> >> >    [1] "LC_COLLATE=English_United States.1252;LC_CTYPE=English_United
> >> > States.1252;
> >> >      LC_MONETARY=English_United
> >> > States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252"
> >> >
> >> >
> >> > It is important to note that that R on linux will read UTF-8 as well as
> >> > exotic character sets without a problem. I've tried it with the exact same
> >> > files (one was UTF-8 and another was OEM866 Cyrillic).
> >> >
> >> > If I do not include the fileEncoding parameter, read.table will read the
> >> > whole CSV file. But naturally it will read it wrong because it does not
> >> > know the encoding. So whenever I try to specify the fileEncoding, R will
> >> > throw the warnings and stop once it reaches a foreign character. It's the
> >> > same story with all international character encodings.
> >> > Other users on stackexchange have reported exactly the same issue.
> >> >
> >> >
> >> > Is anyone here who is on a US version of Windows able to import files with
> >> > foreign characters? Please let me know.
> >> A reproducible example would have helped, as requested by the posting
> >> guide.
> >>
> >> Though I am also experiencing the same problem after saving the data
> >> below to a CSV file encoded in UTF-8 (you can do this using even the
> >> Notepad):
> >> "Ա","Բ"
> >> 1,10
> >> 2,20
> >>
> >> This is on a Windows 7 box using French locale, but same codepage 1252
> >> as yours. What is interesting is that reading the file using
> >> readLines(file("myFile.csv", encoding="UTF-8"))
> >> gives no invalid characters. So there must be a bug in read.table().
> >>
> >>
> >> But I must note I do not experience issues with French accentuated
> >> characters like "é" ("\Ue9"). On the contrary, reading Armenian
> >> characters like "Ա" ("\U531") gives weird results: the character appears
> >> as <U+0531> instead of Ա.
> >>
> >> Self-contained example, writing the file and reading it back from R:
> >> tmpfile <- tempfile()
> >> writeLines("\U531", file(tmpfile, "w", encoding="UTF-8"))
> >> readLines(file(tmpfile, encoding="UTF-8"))
> >> # "<U+0531>"
> >>
> >> The same phenomenon happens when creating a data frame from this
> >> character (as noted on StackExchange):
> >> data.frame("\U531")
> >>
> >> So my conclusion is that maybe Windows does not really support Unicode
> >> characters that are not "relevant" for your current locale. And that may
> >> have created bugs in the way R handles them in read.table(). R
> >> developers can probably tell us more about it.
> > After some more investigation, one part of the problem can be traced
> > back to scan() (with myFile.csv filled as described above):
> > scan("myFile.csv", encoding="UTF-8", sep=",", nlines=1)
> > # Read 2 items
> > # [1] "Ա" "Բ"
> >
> > Equivalent, but nonsensical to me:
> > scan("myFile.csv", fileEncoding="CP1252", encoding="UTF-8", sep=",", nlines=1)
> > # Read 2 items
> > # [1] "Ա" "Բ"
> >
> > scan("myFile.csv", fileEncoding="UTF-8", sep=",", nlines=1)
> > # Read 0 items
> > # character(0)
> > # Warning message:
> > # In scan(file, what, nmax, sep, dex, quote, skip, nlines, na.strings,  :
> > #  invalid input found on input connection 'myFile.csv'
> >
> >
> > So there seem to be one part of the issue in scan(), which for some
> > reason does not work when passed fileEncoding="UTF-8"; and another part
> > in read.table(), which transforms "Ա" ("\U531") into "X.U.0531.",
> > probably via make.names(), since:
> > make.names("\U531")
> > # "X.U.0531."
> >
> >
> > Does this make sense to R-core members?
> >
> >
> > Regards