[Rd] R 3.5.3 and 3.6.0 alpha Windows bug: UTF-8 characters in code are simplified to wrong ones

Wed Apr 10 13:10:12 CEST 2019

On 4/10/19 10:22 AM, Tomáš Bořil wrote:
> Hello,
>
> There is a long-lasting problem with processing UTF-8 source code in R
> on Windows OS. As Windows do not have "UTF-8" locale and R passes
> source code through OS before executing it, some characters are
> "simplified" by the OS before processing, leading to undesirable
> changes.
>
> Minimalistic example:
> Let's type "ř" (LATIN SMALL LETTER R WITH CARON) in RGui console:
>> "ř"
> [1] "r"
>
> Let's assume the following script:
> # file [script.R]
> if ("ř" != "\U00159") {
>      stop("Problem: Unexpected character conversion.")
> } else {
>      cat("o.k.\n")
> }
>
> Problem:
> source("script.R", encoding = "UTF-8")
>
> OK (see https://stackoverflow.com/questions/5031630/how-to-source-r-file-saved-using-utf-8-encoding):
> eval(parse("script.R", encoding = "UTF-8"))

On my system with your example,

>  source("t.r")
Error in eval(ei, envir) : Problem: Unexpected character conversion.
>  source("/Users/tomas/t.r", encoding="UTF-8")
Error in eval(ei, envir) : Problem: Unexpected character conversion..
>  eval(parse("t.r", encoding="UTF-8"))
o.k.

Which is expected, unfortunately. As per documentation of ?source, the 
"encoding" argument tells source() that the input is in UTF-8, so that 
source() can convert it to the native encoding. Again as documented, 
parse() uses its encoding argument to mark the encoding of the strings, 
but it does not re-encode, and the character strings in the parsed 
result will as documented have the encoding mark (UTF-8 in this case).
> Although the script is in UTF-8, the characters are replaced by
> "simplified" substitutes uncontrollably (depending on OS locale). The
> same goes with simply entering the code statements in R Console.
>
> The problem does not occur on OS with UTF-8 locale (Mac OS, Linux...)

Yes. By default, Windows uses "best fit" when translating characters to 
the native encoding. This could be changed in principle, but could break 
existing applications that may depend on it, and it won't really help 
because such characters cannot be represented anyway. You can find more 
in ?Encoding, but yes, it is a known problem frequently encountered by 
users and unless Windows starts supporting UTF-8 as native encoding, 
there is no easy fix (a version from Windows 10 Insider preview supports 
it, so maybe that is not completely hopeless). In theory you can 
carefully read the documentation and use only functions that can work 
with UTF-8 without converting to native encoding, but pragmatically, if 
you want to work with UTF-8 files in R, it is best to use a non-Windows 
platform.

Best
Tomas

>
> Best regards
> Tomas Boril
>
>> R.version
>                 _
> platform       x86_64-w64-mingw32
> arch           x86_64
> os             mingw32
> system         x86_64, mingw32
> status         alpha
> major          3
> minor          6.0
> year           2019
> month          04
> day            07
> svn rev        76333
> language       R
> version.string R version 3.6.0 alpha (2019-04-07 r76333)
> nickname
>
>> Sys.getlocale()
> [1] "LC_COLLATE=English_United States.1252;LC_CTYPE=English_United
> States.1252;LC_MONETARY=English_United
> States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252"
>
> ______________________________________________
> R-devel using r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

	[[alternative HTML version deleted]]