[R] Problem comparing two strings
Duncan Murdoch
murdoch@dunc@n @end|ng |rom gm@||@com
Mon Nov 18 16:45:55 CET 2019
On 18/11/2019 10:11 a.m., Björn Fisseler wrote:
> Hello,
>
> I'm struggling comparing two strings, which come from different data
> sets. This strings are identical: "Alexander Jäger"
>
> But when I compare these strings: string1 == string2
> the result is FALSE.
>
> Looking at the raw bytes used to encode the strings, the results are
> different:
>
> string1: 41 6c 65 78 61 6e 64 65 72 20 4a c3 a4 67 65 72
> string2: 41 6c 65 78 61 6e 64 65 72 20 4a 61 cc 88 67 65 72
>
> string2 comes from the file names of different files on my machine
> (macOS), string1 comes from a data file (csv, UTF8 encoding).
>
> It's obviously the umlaut "ä" in this example which is encoded with two
> respectively three bytes. The question is how to change this? This
> problem makes it impossible to join the two data sets based on the
> names. I already checked the settings on my machine: Sys.getlocale()
> returns "de_DE.UTF-8/de_DE.UTF-8/de_DE.UTF-8/C/de_DE.UTF-8/de_DE.UTF-8".
> Changing/forcing the encoding of the data didn't bring the results I
> expected.
>
> What else can I try?
Characters like ä have two (or more) representations in Unicode: a
single code point, or the code point for "a" followed by a code point
that says "add an umlaut".
If you want to compare strings, you need a consistent representation.
This is called normalizing the string.
There are several possible normalizations; for your purposes it doesn't
matter which one you use, as long as you use the same normalization for
both strings. See
<https://withblue.ink/2019/03/11/why-you-need-to-normalize-unicode-strings.html>
for details.
In R, there are several functions that do the normalization for you.
Two are utf8::utf8_normalize or stringi::stri_trans_nfc. So you'd want
something like
library(utf8)
string1 <- utf8_normalize(string1)
string2 <- utf8_normalize(string2)
string1 == string2 # Should now work as expected
Duncan Murdoch
More information about the R-help
mailing list