[R] Do grep() and strsplit() use different regex engines?

Sun Jul 12 00:31:27 CEST 2015

On Jul 11, 2015, at 3:07 PM, Bert Gunter wrote:

> David/Jeff:
> 
> Thank you both.
> 
> You seem to confirm that my observation of an "infelicity" in
> strsplit() is real. That is most helpful.
> 
> I found nothing in David's message 2 code that was surprising. That
> is, the splits shown conform to what I would expect from "\\b" . But
> not to what I originally showed and David enlarged upon in his first
> message. I still don't really get why a split should occur at every
> letter.
> 
> Jeff may very well have found the explanation, but I have not gone
> through his code.
> 
> If the infelicities noted (are there more?) by David and me are not
> really bugs -- and I would be frankly surprised if they were -- I
> would suggest that perhaps they deserve mention in the strsplit() man
> page. Something to the effect that "\b and \< should not be used as
> split characters..." .

It's more of a regex infelicity or what appears (to us both at a minimum)  as a violation of a 'least surprise principle':

>  gsub("\\b", " ", "  This is a test case")
[1] "     T h i s   i s   a   t e s t   c a s e "

-- 
David.

> 
> Bert Gunter
> 
> "Data is not information. Information is not knowledge. And knowledge
> is certainly not wisdom."
>   -- Clifford Stoll
> 
> 
> On Sat, Jul 11, 2015 at 11:05 AM, David Winsemius
> <dwinsemius at comcast.net> wrote:
>> 
>> On Jul 11, 2015, at 7:47 AM, Bert Gunter wrote:
>> 
>>> I noticed the following:
>>> 
>>>> strsplit("red green","\\b")
>>> [[1]]
>>> [1] "r" "e" "d" " " "g" "r" "e" "e" "n"
>> 
>> After reading the ?regex help page, I didn't understand why `\b` would split within sequences of "word"-characters, either. I expected this to be the result:
>> 
>> [[1]]
>> [1] "red"  " "  "green"
>> 
>> There is a warning in that paragraph: "(The interpretation of ‘word’ depends on the locale and implementation.)"
>> 
>> I got the expected result with only one of "\\>" and "\\<"
>> 
>>> strsplit("red green","\\<")
>> [[1]]
>> [1] "r" "e" "d" " " "g" "r" "e" "e" "n"
>> 
>>> strsplit("red green","\\>")
>> [[1]]
>> [1] "red"    " green"
>> 
>> The result with "\\<" seems decidedly unexpected.
>> 
>> I'm wondered if the "original" regex documentation uses the same language as the R help page. So I went to the cited website and find:
>> =======
>> An assertion-character can be any of the following:
>> 
>>        • < – Beginning of word
>>        • > – End of word
>>        • b – Word boundary
>>        • B – Non-word boundary
>>        • d – Digit character (equivalent to [[:digit:]])
>>        • D – Non-digit character (equivalent to [^[:digit:]])
>>        • s – Space character (equivalent to [[:space:]])
>>        • S – Non-space character (equivalent to [^[:space:]])
>>        • w – Word character (equivalent to [[:alnum:]_])
>>        • W – Non-word character (equivalent to [^[:alnum:]_])
>> ========
>> 
>> The word-"word" appears nowhere else on that page.
>> 
>> 
>>>> strsplit("red green","\\W")
>>> [[1]]
>>> [1] "red"   "green"
>> 
>> `\W` matches the byte-width non-word characters. So the " "-character would be discarded.
>> 
>>> 
>>> I would have thought that "\\b" should give what "\\W" did. Note that:
>>> 
>>>> grep("\\bred\\b","red green")
>>> [1] 1
>>> ## as expected
>>> 
>>> Does strsplit use a different regex engine than grep()? Or more
>>> likely, what am I misunderstanding?
>>> 
>>> Thanks.
>>> 
>>> Bert
>>> 
>>> 
>> 
>> 
>> David Winsemius
>> Alameda, CA, USA
>> 

David Winsemius
Alameda, CA, USA