[R] Do grep() and strsplit() use different regex engines?
David Winsemius
dwinsemius at comcast.net
Sat Jul 11 20:05:12 CEST 2015
On Jul 11, 2015, at 7:47 AM, Bert Gunter wrote:
> I noticed the following:
>
>> strsplit("red green","\\b")
> [[1]]
> [1] "r" "e" "d" " " "g" "r" "e" "e" "n"
After reading the ?regex help page, I didn't understand why `\b` would split within sequences of "word"-characters, either. I expected this to be the result:
[[1]]
[1] "red" " " "green"
There is a warning in that paragraph: "(The interpretation of ‘word’ depends on the locale and implementation.)"
I got the expected result with only one of "\\>" and "\\<"
> strsplit("red green","\\<")
[[1]]
[1] "r" "e" "d" " " "g" "r" "e" "e" "n"
> strsplit("red green","\\>")
[[1]]
[1] "red" " green"
The result with "\\<" seems decidedly unexpected.
I'm wondered if the "original" regex documentation uses the same language as the R help page. So I went to the cited website and find:
=======
An assertion-character can be any of the following:
• < – Beginning of word
• > – End of word
• b – Word boundary
• B – Non-word boundary
• d – Digit character (equivalent to [[:digit:]])
• D – Non-digit character (equivalent to [^[:digit:]])
• s – Space character (equivalent to [[:space:]])
• S – Non-space character (equivalent to [^[:space:]])
• w – Word character (equivalent to [[:alnum:]_])
• W – Non-word character (equivalent to [^[:alnum:]_])
========
The word-"word" appears nowhere else on that page.
>> strsplit("red green","\\W")
> [[1]]
> [1] "red" "green"
`\W` matches the byte-width non-word characters. So the " "-character would be discarded.
>
> I would have thought that "\\b" should give what "\\W" did. Note that:
>
>> grep("\\bred\\b","red green")
> [1] 1
> ## as expected
>
> Does strsplit use a different regex engine than grep()? Or more
> likely, what am I misunderstanding?
>
> Thanks.
>
> Bert
>
>
David Winsemius
Alameda, CA, USA
More information about the R-help
mailing list