[Rd] R 3.5.3 and 3.6.0 alpha Windows bug: UTF-8 characters in code are simplified to wrong ones
Tomas Kalibera
tom@@@k@||ber@ @end|ng |rom gm@||@com
Wed Apr 10 13:10:12 CEST 2019
On 4/10/19 10:22 AM, Tomáš Bořil wrote:
> Hello,
>
> There is a long-lasting problem with processing UTF-8 source code in R
> on Windows OS. As Windows do not have "UTF-8" locale and R passes
> source code through OS before executing it, some characters are
> "simplified" by the OS before processing, leading to undesirable
> changes.
>
> Minimalistic example:
> Let's type "ř" (LATIN SMALL LETTER R WITH CARON) in RGui console:
>> "ř"
> [1] "r"
>
> Let's assume the following script:
> # file [script.R]
> if ("ř" != "\U00159") {
> stop("Problem: Unexpected character conversion.")
> } else {
> cat("o.k.\n")
> }
>
> Problem:
> source("script.R", encoding = "UTF-8")
>
> OK (see https://stackoverflow.com/questions/5031630/how-to-source-r-file-saved-using-utf-8-encoding):
> eval(parse("script.R", encoding = "UTF-8"))
On my system with your example,
> source("t.r")
Error in eval(ei, envir) : Problem: Unexpected character conversion.
> source("/Users/tomas/t.r", encoding="UTF-8")
Error in eval(ei, envir) : Problem: Unexpected character conversion..
> eval(parse("t.r", encoding="UTF-8"))
o.k.
Which is expected, unfortunately. As per documentation of ?source, the
"encoding" argument tells source() that the input is in UTF-8, so that
source() can convert it to the native encoding. Again as documented,
parse() uses its encoding argument to mark the encoding of the strings,
but it does not re-encode, and the character strings in the parsed
result will as documented have the encoding mark (UTF-8 in this case).
> Although the script is in UTF-8, the characters are replaced by
> "simplified" substitutes uncontrollably (depending on OS locale). The
> same goes with simply entering the code statements in R Console.
>
> The problem does not occur on OS with UTF-8 locale (Mac OS, Linux...)
Yes. By default, Windows uses "best fit" when translating characters to
the native encoding. This could be changed in principle, but could break
existing applications that may depend on it, and it won't really help
because such characters cannot be represented anyway. You can find more
in ?Encoding, but yes, it is a known problem frequently encountered by
users and unless Windows starts supporting UTF-8 as native encoding,
there is no easy fix (a version from Windows 10 Insider preview supports
it, so maybe that is not completely hopeless). In theory you can
carefully read the documentation and use only functions that can work
with UTF-8 without converting to native encoding, but pragmatically, if
you want to work with UTF-8 files in R, it is best to use a non-Windows
platform.
Best
Tomas
>
> Best regards
> Tomas Boril
>
>> R.version
> _
> platform x86_64-w64-mingw32
> arch x86_64
> os mingw32
> system x86_64, mingw32
> status alpha
> major 3
> minor 6.0
> year 2019
> month 04
> day 07
> svn rev 76333
> language R
> version.string R version 3.6.0 alpha (2019-04-07 r76333)
> nickname
>
>> Sys.getlocale()
> [1] "LC_COLLATE=English_United States.1252;LC_CTYPE=English_United
> States.1252;LC_MONETARY=English_United
> States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252"
>
> ______________________________________________
> R-devel using r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
[[alternative HTML version deleted]]
More information about the R-devel
mailing list