[Rd] Problems with sub() due to inability to set encoding of ASCII strings
Winston Chang
winstonchang1 at gmail.com
Thu Sep 29 18:38:16 CEST 2016
I'm encountering a problem using sub() on strings in R 3.3.1 in
Windows, using both RGui and RStudio. The problem happens when the
starting string is ASCII, but the replacement string is UTF-8.
If we create an ASCII string x1, its encoding is marked as "unknown",
and doing a sub() on that string with a UTF-8 replacement results in
weird characters:
x1 <- "a b c"
Encoding(x1)
# [1] "unknown"
replacement <- "中文"
Encoding(replacement)
# [1] "UTF-8"
(y1 <- sub("a", replacement, x1))
#[1] "ä¸æ–‡ b c"
Encoding(y1)
# [1] "unknown"
If the starting string x2 has Chinese characters, it'll be marked as
UTF-8, and replacement works fine:
x2 <- "a b c 中文"
Encoding(x2)
# [1] "UTF-8"
(y2 <- sub("a", replacement, x2))
# [1] "中文 b c 中文"
Encoding(y2)
# [1] "UTF-8"
It seems like the solution should be to mark the starting string as
UTF-8, but apparently it doesn't work if the string is ASCII, and so
the sub() still gives weird characters:
# Not possible to mark x1 as UTF-8
Encoding(x1) <- "UTF-8"
Encoding(x1)
# [1] "unknown"
(y3 <- sub("a", replacement, x1))
# [1] "ä¸æ–‡ b c"
Encoding(y3)
# [1] "unknown"
It is possible to tell R that the final string y3 is UTF-8, but it
doesn't seem like this should be necessary:
Encoding(y3) <- "UTF-8"
y3
# [1] "中文 b c"
Is there some way to mark the starting string x1 as UTF-8 so that the
result of sub() comes out marked as UTF-8? If the inputs are both
UTF-8, it shouldn't be necessary to explicitly tell R that the output
is also UTF-8.
-Winston
More information about the R-devel
mailing list