[R] Problem comparing two strings
peter dalgaard
pd@|gd @end|ng |rom gm@||@com
Mon Nov 18 16:48:04 CET 2019
A version of this came up not long ago in a slightly different context (bug 17369: parse() doesn't honor unicode in NFD normalization).
The basic issue is that there are different unicode normalizations (look it up...).
Briefly, accented characters exist in two forms, one as a single code point and another as the base letter followed by the accent.
I.e. there is the single letter "ä" and then "a\u308" which is a followed by "combining diaeresis" which effectively put a ¨ on top of the preceding character.
The utf8 package has code for normalizing strings.
-pd
> On 18 Nov 2019, at 16:11 , Björn Fisseler <bjoern.fisseler using googlemail.com> wrote:
>
> Hello,
>
> I'm struggling comparing two strings, which come from different data
> sets. This strings are identical: "Alexander Jäger"
>
> But when I compare these strings: string1 == string2
> the result is FALSE.
>
> Looking at the raw bytes used to encode the strings, the results are
> different:
>
> string1: 41 6c 65 78 61 6e 64 65 72 20 4a c3 a4 67 65 72
> string2: 41 6c 65 78 61 6e 64 65 72 20 4a 61 cc 88 67 65 72
>
> string2 comes from the file names of different files on my machine
> (macOS), string1 comes from a data file (csv, UTF8 encoding).
>
> It's obviously the umlaut "ä" in this example which is encoded with two
> respectively three bytes. The question is how to change this? This
> problem makes it impossible to join the two data sets based on the
> names. I already checked the settings on my machine: Sys.getlocale()
> returns "de_DE.UTF-8/de_DE.UTF-8/de_DE.UTF-8/C/de_DE.UTF-8/de_DE.UTF-8".
> Changing/forcing the encoding of the data didn't bring the results I
> expected.
>
> What else can I try?
>
> Best regards
>
> Björn
>
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
--
Peter Dalgaard, Professor,
Center for Statistics, Copenhagen Business School
Solbjerg Plads 3, 2000 Frederiksberg, Denmark
Phone: (+45)38153501
Office: A 4.23
Email: pd.mes using cbs.dk Priv: PDalgd using gmail.com
More information about the R-help
mailing list