[R] Do grep() and strsplit() use different regex engines?
David Winsemius
dwinsemius at comcast.net
Sun Jul 12 00:31:27 CEST 2015
On Jul 11, 2015, at 3:07 PM, Bert Gunter wrote:
> David/Jeff:
>
> Thank you both.
>
> You seem to confirm that my observation of an "infelicity" in
> strsplit() is real. That is most helpful.
>
> I found nothing in David's message 2 code that was surprising. That
> is, the splits shown conform to what I would expect from "\\b" . But
> not to what I originally showed and David enlarged upon in his first
> message. I still don't really get why a split should occur at every
> letter.
>
> Jeff may very well have found the explanation, but I have not gone
> through his code.
>
> If the infelicities noted (are there more?) by David and me are not
> really bugs -- and I would be frankly surprised if they were -- I
> would suggest that perhaps they deserve mention in the strsplit() man
> page. Something to the effect that "\b and \< should not be used as
> split characters..." .
It's more of a regex infelicity or what appears (to us both at a minimum) as a violation of a 'least surprise principle':
> gsub("\\b", " ", " This is a test case")
[1] " T h i s i s a t e s t c a s e "
--
David.
>
> Bert Gunter
>
> "Data is not information. Information is not knowledge. And knowledge
> is certainly not wisdom."
> -- Clifford Stoll
>
>
> On Sat, Jul 11, 2015 at 11:05 AM, David Winsemius
> <dwinsemius at comcast.net> wrote:
>>
>> On Jul 11, 2015, at 7:47 AM, Bert Gunter wrote:
>>
>>> I noticed the following:
>>>
>>>> strsplit("red green","\\b")
>>> [[1]]
>>> [1] "r" "e" "d" " " "g" "r" "e" "e" "n"
>>
>> After reading the ?regex help page, I didn't understand why `\b` would split within sequences of "word"-characters, either. I expected this to be the result:
>>
>> [[1]]
>> [1] "red" " " "green"
>>
>> There is a warning in that paragraph: "(The interpretation of ‘word’ depends on the locale and implementation.)"
>>
>> I got the expected result with only one of "\\>" and "\\<"
>>
>>> strsplit("red green","\\<")
>> [[1]]
>> [1] "r" "e" "d" " " "g" "r" "e" "e" "n"
>>
>>> strsplit("red green","\\>")
>> [[1]]
>> [1] "red" " green"
>>
>> The result with "\\<" seems decidedly unexpected.
>>
>> I'm wondered if the "original" regex documentation uses the same language as the R help page. So I went to the cited website and find:
>> =======
>> An assertion-character can be any of the following:
>>
>> • < – Beginning of word
>> • > – End of word
>> • b – Word boundary
>> • B – Non-word boundary
>> • d – Digit character (equivalent to [[:digit:]])
>> • D – Non-digit character (equivalent to [^[:digit:]])
>> • s – Space character (equivalent to [[:space:]])
>> • S – Non-space character (equivalent to [^[:space:]])
>> • w – Word character (equivalent to [[:alnum:]_])
>> • W – Non-word character (equivalent to [^[:alnum:]_])
>> ========
>>
>> The word-"word" appears nowhere else on that page.
>>
>>
>>>> strsplit("red green","\\W")
>>> [[1]]
>>> [1] "red" "green"
>>
>> `\W` matches the byte-width non-word characters. So the " "-character would be discarded.
>>
>>>
>>> I would have thought that "\\b" should give what "\\W" did. Note that:
>>>
>>>> grep("\\bred\\b","red green")
>>> [1] 1
>>> ## as expected
>>>
>>> Does strsplit use a different regex engine than grep()? Or more
>>> likely, what am I misunderstanding?
>>>
>>> Thanks.
>>>
>>> Bert
>>>
>>>
>>
>>
>> David Winsemius
>> Alameda, CA, USA
>>
David Winsemius
Alameda, CA, USA
More information about the R-help
mailing list