[R] Problem comparing two strings
Björn Fisseler
bjoern@||@@e|er @end|ng |rom goog|em@||@com
Mon Nov 18 16:39:06 CET 2019
Thank you! That solved my problem!
Best
Björn
Am 18.11.19 um 16:34 schrieb Ivan Krylov:
> On Mon, 18 Nov 2019 16:11:44 +0100
> "Björn Fisseler" <bjoern.fisseler using googlemail.com> wrote:
>
>> It's obviously the umlaut "ä" in this example which is encoded with
>> two respectively three bytes. The question is how to change this?
> Welcome to the wonderful world of Unicode-related problems! It is,
> indeed, possible to represent the same glyph using either one
> code-point (LATIN SMALL LETTER A WITH DIAERESIS) or two code points
> (LATIN SMALL LETTER A followed by COMBINING DIAERESIS). (Other
> combinations of code points resulting in the same glyph are probably
> also possible.)
>
> What you are looking for is called "Unicode normalization" and it is
> implemented in the stringi package, in functions stri_trans_nfc
> (normalization: there are multiple normal forms to choose from but W3C
> guidelines recommend NFC) and stri_compare / stri_cmp (test for
> canonical equivalence).
>
> See also: ?stringi::stri_cmp and https://stackoverflow.com/a/20684794
>
[[alternative HTML version deleted]]
More information about the R-help
mailing list