[Rd] R 3.5.3 and 3.6.0 alpha Windows bug: UTF-8 characters in code are simplified to wrong ones
Tomáš Bořil
bor||t @end|ng |rom gm@||@com
Wed Apr 10 10:22:04 CEST 2019
Hello,
There is a long-lasting problem with processing UTF-8 source code in R
on Windows OS. As Windows do not have "UTF-8" locale and R passes
source code through OS before executing it, some characters are
"simplified" by the OS before processing, leading to undesirable
changes.
Minimalistic example:
Let's type "ř" (LATIN SMALL LETTER R WITH CARON) in RGui console:
> "ř"
[1] "r"
Let's assume the following script:
# file [script.R]
if ("ř" != "\U00159") {
stop("Problem: Unexpected character conversion.")
} else {
cat("o.k.\n")
}
Problem:
source("script.R", encoding = "UTF-8")
OK (see https://stackoverflow.com/questions/5031630/how-to-source-r-file-saved-using-utf-8-encoding):
eval(parse("script.R", encoding = "UTF-8"))
Although the script is in UTF-8, the characters are replaced by
"simplified" substitutes uncontrollably (depending on OS locale). The
same goes with simply entering the code statements in R Console.
The problem does not occur on OS with UTF-8 locale (Mac OS, Linux...)
Best regards
Tomas Boril
> R.version
_
platform x86_64-w64-mingw32
arch x86_64
os mingw32
system x86_64, mingw32
status alpha
major 3
minor 6.0
year 2019
month 04
day 07
svn rev 76333
language R
version.string R version 3.6.0 alpha (2019-04-07 r76333)
nickname
> Sys.getlocale()
[1] "LC_COLLATE=English_United States.1252;LC_CTYPE=English_United
States.1252;LC_MONETARY=English_United
States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252"
More information about the R-devel
mailing list