[Rd] Windows iconv() "failure" in certain locales
Martin Maechler
maechler at stat.math.ethz.ch
Thu Jun 29 12:27:51 CEST 2017
>>>>> Uwe Ligges <ligges at statistik.tu-dortmund.de>
>>>>> on Wed, 28 Jun 2017 18:45:59 +0200 writes:
> On 27.06.2017 17:36, Martin Maechler wrote:
>> This is a continuation of the R-devel thread with subject
>> "suggestion to fix packageDescription() for Windows users" :
>>
>> As I said there, a patch should rather address the underlying
>> problem in packageDescription rather than a kludgy workaround
>> patch for citation().
>> (For that same reason, Ben Marwick proposed to fix
>> packageDescription() rather than the symptom seen in citation().)
>>
>> It's not hard to see that the problem is that iconv() in
>> Windows does not always succeed to translate from "UTF-8" to the
>> "current locale", in the case mentioned there.
>>
>> I'm giving some easier reproducible examples: no need to install
>> half of tidyverse just to get citation("readr") :
>>
>>> x <- c("Ekstr\xf8m", "J\xf6reskog", "bi\xdfchen Z\xfcrcher")
>>> Encoding(x1) <- "latin1"
>>> xU <- iconv(x1, "latin1", "UTF-8")
>>
>>> Sys.setlocale("LC_CTYPE", "Chinese")
>> [1] "Chinese (Simplified)_People's Republic of China.936"
>>>
>>> iconv(x1, "latin1", "") # NA NA NA
>> [1] NA NA NA
>>> iconv(xU, "UTF-8", "") # NA NA NA
>> [1] NA NA NA
>>> iconv(xU, "UTF-8", "//TRANSLIT")
>> [1] "Ekstrøm" "Jöreskog" "bißchen Zürcher"
> Interesting, I get chinese characters here.
For which one of the above cases; can you show them
(it may survive E-mail servers; we had other
Chinese R strings on R-help / R-devel recently, right?)
In any case, I think that is even worse, isn't it?
As also in a Chinese locale you'd want explicit-latin1 text to
see in something that looks like latin-1 (I know from a master's
student that Windows+Chinese can well show latin-1-like
letters also interspersed in the Chinese text),
no ?
> Beside the comments from Duncan Murdoch:
> iconv(x1, "latin1", "", sub="?")
> etc. would be an alternative in case some characters really cannot be
> converted into the target encoding and should perhaps be considered for
> the time after Duncan commits the fix for the underlying porblem.
Yes. I'd had the same idea that's why I used it in the code I
sent along.
So,
1) we definitely won't commit the workaround patch for citation().
2) I have a "workaround patch" for packageDescription() which is
more useful in the sense that only if iconv() produces NA's, it
tries alternatives, notably "//TRANSLIT", "ASCII//TRANSLIT"
(the latter Ben also mentioned, but my patch would only use it
in the NA case) and also the same 'sub="?"' that you mention
above, Uwe.
That patch is not Windows-specific and will automatically
also help in other cases / platforms where the iconv()
re-encoding leads to partial NAs.
@Duncan M: would you _not_ want me to commit that either?
Martin
More information about the R-devel
mailing list