[Rd] How to print UTF-8 encoded strings from a C routine to R's output?
Lixin Gong
lgong at uw.edu
Tue Sep 6 03:05:49 CEST 2016
Hi Duncan,
Thanks a lot for your quick reply pointing out the Re-encoding section that
I missed!
Before trying out R's C-level interface to the iconv's encoding conversion
capabilities,
I did some quick tests with Encoding() and iconv() on Windows with Rgui and
Rterm.
After Encoding(), non-ASCII characters are fine with Rgui but still wrong
with Rterm.
After iconv(), non-ASCII characters are still misprinted no matter if it is
Rgui or Rterm.
Here is the code that I used:
(neg_inf_utf8_hex <- as.raw(c(0x2d, 0xe2, 0x88, 0x9e)))
(neg_inf_utf8 <- rawToChar(neg_inf_utf8_hex))
Encoding(neg_inf_utf8)
Encoding(neg_inf_utf8) <- "UTF-8"
Encoding(neg_inf_utf8)
neg_inf_utf8
charToRaw(neg_inf_utf8)
iconv(neg_inf_utf8, from = "UTF-8", to = "", toRaw = FALSE)
iconv(neg_inf_utf8, from = "UTF-8", to = "", toRaw = TRUE)
Here is what I got with Rgui:
> (neg_inf_utf8_hex <- as.raw(c(0x2d, 0xe2, 0x88, 0x9e)))
[1] 2d e2 88 9e
> (neg_inf_utf8 <- rawToChar(neg_inf_utf8_hex))
[1] "-∞"
> Encoding(neg_inf_utf8)
[1] "unknown"
>
> Encoding(neg_inf_utf8) <- "UTF-8"
> Encoding(neg_inf_utf8)
[1] "UTF-8"
> neg_inf_utf8
[1] "-∞"
>
> charToRaw(neg_inf_utf8)
[1] 2d e2 88 9e
> iconv(neg_inf_utf8, from = "UTF-8", to = "", toRaw = FALSE)
[1] "-8"
> iconv(neg_inf_utf8, from = "UTF-8", to = "", toRaw = TRUE)
[[1]]
[1] 2d 38
>
Here is what I got with Rterm:
> (neg_inf_utf8_hex <- as.raw(c(0x2d, 0xe2, 0x88, 0x9e)))
[1] 2d e2 88 9e
> (neg_inf_utf8 <- rawToChar(neg_inf_utf8_hex))
[1] "-â^z"
> Encoding(neg_inf_utf8)
[1] "unknown"
>
> Encoding(neg_inf_utf8) <- "UTF-8"
> Encoding(neg_inf_utf8)
[1] "UTF-8"
> neg_inf_utf8
[1] "-8"
>
> charToRaw(neg_inf_utf8)
[1] 2d e2 88 9e
> iconv(neg_inf_utf8, from = "UTF-8", to = "", toRaw = FALSE)
[1] "-8"
> iconv(neg_inf_utf8, from = "UTF-8", to = "", toRaw = TRUE)
[[1]]
[1] 2d 38
>
Here is the sessionInfo:
> sessionInfo()
R version 3.3.1 (2016-06-21)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 14393)
locale:
[1] LC_COLLATE=English_United States.1252
[2] LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
>
Am I missing something obvious? Thanks a lot for your help and your time!
Michael
On Mon, Sep 5, 2016 at 3:31 AM, Duncan Murdoch <murdoch.duncan at gmail.com>
wrote:
> On 05/09/2016 12:40 AM, Lixin Gong wrote:
>
>> Dear R experts,
>>
>> It seems that Rprintf has to be used to print from a C routine to
>> guarantee
>> to write to R’s output according to
>> https://cran.r-project.org/doc/manuals/r-release/R-exts.html#Printing.
>>
>> However if a string is UTF-8 encoded, non-ASCII characters (e.g., the
>> infinity symbol http://www.fileformat.info/inf
>> o/unicode/char/221e/index.htm)
>> are misprinted.
>> Is this an unsupported feature or is there a workaround for this
>> limitation?
>>
>
> If you are working in a UTF-8 locale (as on most Unix-like systems), you
> should be fine. If not (as is normal on Windows), you'll need to translate
> the string to the local encoding. The Writing R Extensions manual section
> 6.11 tells you how to do the re-encoding.
>
> Duncan Murdoch
>
>
[[alternative HTML version deleted]]
More information about the R-devel
mailing list