[Rd] R 3.5.3 and 3.6.0 alpha Windows bug: UTF-8 characters in code are simplified to wrong ones

Wed Apr 10 18:46:20 CEST 2019

On 10/04/2019 12:32 p.m., Jeroen Ooms wrote:
> On Wed, Apr 10, 2019 at 5:45 PM Duncan Murdoch <murdoch.duncan using gmail.com> wrote:
>>
>> On 10/04/2019 10:29 a.m., Yihui Xie wrote:
>>> Since it is "technically easy" to disable the best fit conversion and
>>> the best fit is rarely good, how about providing an option for
>>> code/package authors to disable it? I'm asking because this is one of
>>> the most painful issues in packages that may need to source() code
>>> containing UTF-8 characters that are not representable in the Windows
>>> native encoding. Examples include knitr/rmarkdown and shiny. Basically
>>> users won't be able to knit documents or run Shiny apps correctly when
>>> the code contains characters that cannot be represented in the native
>>> encoding.
>>
>> Wouldn't things be worse with it disabled than currently?  I'd expect
>> the line containing the "ř" to end up as NA instead of converting to "r".
> 
> I don't think it would be worse, because in this case R would not
> implicitly convert strings to (best fit) latin1 on Windows, but
> instead keep the (correct) string in its UTF-8 encoding. The NA only
> appears if the user explicitly forces a conversion to latin1, which is
> not the problem here I think.
> 
> The original problem that I can reproduce in RGui is that if you enter
>   "ř" in RGui, R opportunistically converts this to latin1, because it
> can. However if you enter text which can definitely not be represented
> in latin1, R encodes the string correctly in UTF-8 form.
> 

I think the pathways for text in RGui and text being sourced are 
different.  I agree fixing RGui in that way would make sense, but Yihui 
was talking about source().

Duncan Murdoch