[Rd] Windows iconv() "failure" in certain locales

Martin Maechler maechler at stat.math.ethz.ch
Thu Jun 29 12:27:51 CEST 2017


>>>>> Uwe Ligges <ligges at statistik.tu-dortmund.de>
>>>>>     on Wed, 28 Jun 2017 18:45:59 +0200 writes:

    > On 27.06.2017 17:36, Martin Maechler wrote:
    >> This is a continuation of the R-devel thread with subject
    >> "suggestion to fix packageDescription() for Windows users" :
    >> 
    >> As I said there, a patch should rather address the underlying
    >> problem in packageDescription rather than a kludgy workaround
    >> patch for  citation().
    >> (For that same reason, Ben Marwick proposed to fix
    >> packageDescription() rather than the symptom seen in citation().)
    >> 
    >> It's not hard to see that the problem is that  iconv() in
    >> Windows does not always succeed to translate from "UTF-8" to the
    >> "current locale", in the case mentioned there.
    >> 
    >> I'm giving some easier reproducible examples:  no need to install
    >> half of tidyverse just to get citation("readr") :
    >> 
    >>> x <- c("Ekstr\xf8m", "J\xf6reskog", "bi\xdfchen Z\xfcrcher")
    >>> Encoding(x1) <- "latin1"
    >>> xU <- iconv(x1, "latin1", "UTF-8")
    >> 
    >>> Sys.setlocale("LC_CTYPE", "Chinese")
    >> [1] "Chinese (Simplified)_People's Republic of China.936"
    >>> 
    >>> iconv(x1, "latin1", "") # NA NA NA
    >> [1] NA NA NA
    >>> iconv(xU, "UTF-8", "") # NA NA NA
    >> [1] NA NA NA
    >>> iconv(xU, "UTF-8", "//TRANSLIT")
    >> [1] "Ekstrøm"         "Jöreskog"        "bißchen Zürcher"

    > Interesting, I get chinese characters here.

For which one of the above cases; can you show them
 (it may survive E-mail servers; we had other
  Chinese R strings on R-help / R-devel recently, right?)

In any case, I think  that is even worse, isn't it?  
As also in a Chinese locale you'd want explicit-latin1 text to
see in something that looks like latin-1 (I know from a master's
 student that Windows+Chinese can well show latin-1-like
 letters also interspersed in the Chinese text),
no ? 


    > Beside the comments from Duncan Murdoch:

    > iconv(x1, "latin1", "", sub="?")
    > etc. would be an alternative in case some characters really cannot be 
    > converted into the target encoding and should perhaps be considered for 
    > the time after Duncan commits the fix for the underlying porblem.

Yes. I'd had the same idea that's why I used it in the code I
sent along.

So,

1)  we definitely won't commit the workaround patch for citation().

2) I have a "workaround patch" for packageDescription() which is
   more useful in the sense that only if iconv() produces NA's, it
   tries alternatives, notably   "//TRANSLIT",  "ASCII//TRANSLIT"
   (the latter Ben also mentioned, but my patch would only use it
    in the NA case) and also the same  'sub="?"' that you mention
    above, Uwe.

   That patch is not Windows-specific and will automatically
   also help in other cases / platforms where the iconv()
   re-encoding leads to partial NAs.
   
  @Duncan M: would you _not_ want me to commit that either?

Martin



More information about the R-devel mailing list