[R] Symbol/String comparison in R
Kristjan Kure
kr|@tj@n@kure@1 @end|ng |rom gm@||@com
Thu Apr 14 23:09:20 CEST 2022
Thank you, Rui. Not sure I got everything right, but here it is:
*current_loc <- Sys.getlocale("LC_COLLATE")*
#> [1] "Estonian_Estonia.1257"
"A" < "a"
#41 < 61
#> [1] FALSE
raw_A <- charToRaw("A") #41
raw_a <- charToRaw("a") #61
# Not OK - should be TRUE (41 is less than 61)
"A" > "a"
#41 > 61
#> [1] TRUE
raw_A <- charToRaw("A") #41
raw_a <- charToRaw("a") #61
# Not OK - should be FALSE (41 is not bigger than 61)
*Sys.setlocale("LC_COLLATE", locale = "C")*
"A" < "a"
#41 < 61
#> [1] TRUE
raw_A <- charToRaw("A") #41
raw_a <- charToRaw("a") #61
# OK - (41 is less than 61)
"A" > "a"
#41 > 61
#> [1] FALSE
raw_A <- charToRaw("A") #41
raw_a <- charToRaw("a") #61
# OK - (41 is not bigger than 61)
*Sys.setlocale("LC_COLLATE", current_loc)*
*Conclusion: Comparing strings using charToRaw() only works correctly with
locale = C?*
Regards,
Kristjan
On Thu, Apr 14, 2022 at 10:01 PM Rui Barradas <ruipbarradas using sapo.pt> wrote:
> Hello,
>
> 1) The best I could find on lower case/upper case is [1];
> The Wikipedia page you link to is about a code page and the collating
> sequence is the same as ASCII so no, that's not it.
>
> 2) In the cp1252 table "A" < "a", it follows the numeric order 0x31 <
> 0x41. But what R is using is the locale LC_COLLATE setting, not the "C"
> one.
>
> How to validate the end results? The best way is to check the current
> setting, with Sys.getlocale.
>
>
>
> [1]
>
> https://books.google.pt/books?id=GkajBQAAQBAJ&pg=PA259&lpg=PA259&dq=collating+sequence+portuguese&source=bl&ots=fVnUYHz0ev&sig=ACfU3U3xjpJfPNcWEfvwb_2nScYb89CeOw&hl=pt-PT&sa=X&ved=2ahUKEwiAoNTW-JP3AhVI1xoKHXT-C4oQ6AF6BAgUEAM#v=onepage&q=collating%20sequence%20portuguese&f=false
>
>
> Hope this helps,
>
> Rui Barradas
>
> Às 16:33 de 14/04/2022, Kristjan Kure escreveu:
> > Hi Rui
> >
> > Thank you for the code snippet.
> >
> > 1) How do you find your "Portuguese_Portugal.1252" symbols table now?
> > Is it this https://en.wikipedia.org/wiki/Windows-1252
> > <https://en.wikipedia.org/wiki/Windows-1252>?
> >
> > 2) What attributes and values do you check to validate the end result?
> > I see there is a section "Codepage layout" and I can find "A" and "a"
> > symbols.
> >
> > What values on that table tell you "A" is bigger than "a"?
> > "A" < "a" # returns FALSE
> > "A" > "a" # returns TRUE
> >
> > PS! My locale is Estonian_Estonia.1257
> >
> > Regards,
> > Kristjan
> >
> > On Thu, Apr 14, 2022 at 5:05 PM Rui Barradas <ruipbarradas using sapo.pt
> > <mailto:ruipbarradas using sapo.pt>> wrote:
> >
> > Hello,
> >
> > This is a locale issue, you are counting on the ASCII table codes but
> > that's only valid for the "C" locale.
> >
> > old_loc <- Sys.getlocale("LC_COLLATE")
> >
> > "A" < "a"
> > #> [1] FALSE
> > "A" > "a"
> > #> [1] TRUE
> >
> > Sys.setlocale("LC_COLLATE", locale = "C")
> > #> [1] "C"
> >
> > "A" < "a"
> > #> [1] TRUE
> > "A" > "a"
> > #> [1] FALSE
> >
> > Sys.setlocale("LC_COLLATE", old_loc)
> > #> [1] "Portuguese_Portugal.1252"
> >
> >
> > Hope this helps,
> >
> > Rui Barradas
> >
> > Às 15:06 de 13/04/2022, Kristjan Kure escreveu:
> > > Hi!
> > >
> > > Sorry, I am a beginner in R.
> > >
> > > I was not able to find answers to my questions (tried Google,
> Stack
> > > Overflow, etc). Please correct me if anything is wrong here.
> > >
> > > When comparing symbols/strings in R - raw numeric values are
> compared
> > > symbol by symbol starting from left? If raw numeric values are
> > not used is
> > > there an ASCII / Unicode table where symbols have
> > values/ranking/order and
> > > R compares those values?
> > >
> > > *2) Comparing symbols*
> > > Letter "a" raw value is 61, letter "b" raw value is 62? Is this
> > correct?
> > >
> > > # Raw value for "a" = 61
> > > a_raw <- charToRaw("a")
> > > a_raw
> > >
> > > # Raw value for "b" = 62
> > > b_raw <- charToRaw("b")
> > > b_raw
> > >
> > > # equals TRUE
> > > "a" < "b"
> > >
> > > Ok, so 61 is less than 62 so it's TRUE. Is this correct?
> > >
> > > *3) Comparing strings #1*
> > > "1040" <= "12000"
> > >
> > > raw_1040 <- charToRaw("1040")
> > > raw_1040
> > > #31 *30* (comparison happens with the second symbol) 34 30
> > >
> > > raw_12000 <- charToRaw("12000")
> > > raw_12000
> > > #31 *32* (comparison happens with the second symbol) 30 30 30
> > >
> > > The symbol in the second position is 30 and it's less than 32.
> > Equals to
> > > true. Is this correct?
> > >
> > > *4) Comparing strings #2*
> > > "1040" <= "10000"
> > >
> > > raw_1040 <- charToRaw("1040")
> > > raw_1040
> > > #31 30 *34* (comparison happens with third symbol) 30
> > >
> > > raw_10000 <- charToRaw("10000")
> > > raw_10000
> > > #31 30 *30* (comparison happens with third symbol) 30 30
> > >
> > > The symbol in the third position is 34 is greater than 30. Equals
> > to false.
> > > Is this correct?
> > >
> > > *5) Problem - Why does this equal FALSE?*
> > > *"A" < "a"*
> > >
> > > 41 < 61 # FALSE?
> > >
> > > # Raw value for "A" = 41
> > > A_raw <- charToRaw("A")
> > > A_raw
> > >
> > > # Raw value for "a" = 61
> > > a_raw <- charToRaw("a")
> > > a_raw
> > >
> > > Why is capitalized "A" not less than lowercase "a"? Based on raw
> > values it
> > > should be. What am I missing here?
> > >
> > > Thanks
> > > Kristjan
> > >
> > > [[alternative HTML version deleted]]
> > >
> > > ______________________________________________
> > > R-help using r-project.org <mailto:R-help using r-project.org> mailing list
> > -- To UNSUBSCRIBE and more, see
> > > https://stat.ethz.ch/mailman/listinfo/r-help
> > <https://stat.ethz.ch/mailman/listinfo/r-help>
> > > PLEASE do read the posting guide
> > http://www.R-project.org/posting-guide.html
> > <http://www.R-project.org/posting-guide.html>
> > > and provide commented, minimal, self-contained, reproducible code.
> >
>
[[alternative HTML version deleted]]
More information about the R-help
mailing list