[R] Regular expressions: bug or misunderstanding?

Duncan Murdoch murdoch at stats.uwo.ca
Mon Jul 7 02:15:44 CEST 2008


On 06/07/2008 7:37 PM, Gabor Grothendieck wrote:
> Look at the discussion of zero width lookahead assertions in ?regex .
> Use perl = TRUE as previously indicated.

Thanks, this seems to work:

gsub( "(?<!E)((EE)*)q", "\\1Eq", x, perl=TRUE)

Duncan Murdoch


> 
> On Sun, Jul 6, 2008 at 7:29 PM, Duncan Murdoch <murdoch at stats.uwo.ca> wrote:
>> On 06/07/2008 5:37 PM, (Ted Harding) wrote:
>>> On 06-Jul-08 21:17:04, Duncan Murdoch wrote:
>>>> I'm trying to write a gsub() call that takes a string and escapes all the
>>>> unescaped quote marks in it.  So the string
>>>>
>>>> \"
>>>>
>>>> would be left unchanged, but
>>>>
>>>> \\"
>>>>
>>>> would be changed to
>>>>
>>>> \\\"
>>>>
>>>> because the double backslash doesn't act as an escape for the quote,
>>>> the first just escapes the second.  I have the usual problems of
>>>> writing regular expressions involving backslashes which make
>>>> everything I write completely unreadable, so I'm going to change
>>>> the problem for this post:  I will define E to be the escape
>>>> character, and q to be the quote; the gsub() call would leave
>>>>
>>>> Eq
>>>>
>>>> unchanged, but would change
>>>>
>>>> EEq
>>>>
>>>> to EEEq, etc.
>>>>
>>>> The expression I have come up with after this change is
>>>>
>>>> gsub( "((^|[^E])(EE)*)q", "\\1Eq", x)
>>>>
>>>> i.e. "(start of line, or non-escape, followed by an even number of
>>>> escapes), all of which we call expression 1, followed by a quote,
>>>> is replaced by expression 1 followed by an escape and a quote".
>>>>
>>>> This works sometimes, but not always:
>>>>
>>>>  > gsub( "((^|[^E])(EE)*)q", "\\1Eq", "Eq")
>>>> [1] "Eq"
>>>>  > gsub( "((^|[^E])(EE)*)q", "\\1Eq", "EEq")
>>>> [1] "EEEq"
>>>>  > gsub( "((^|[^E])(EE)*)q", "\\1Eq", "qaq")
>>>> [1] "EqaEq"
>>>>  > gsub( "((^|[^E])(EE)*)q", "\\1Eq", "qq")
>>>> [1] "qEq"
>>>>
>>>> Notice that in the final example, the first quote doesn't get escaped.
>>>> Why not????
>>> I think (without having done the "experimental diagnostics")
>>> that it's because in "qq" the first q mtaches (^|[^E]) because
>>> it matches [^E] (i.e. is a "non-escape"); since it is followed
>>> by q, it is the second q which gets the escape. Possibly you
>>> need to include "^q" as an additional alternative match at the
>>> start of the line.
>> Thanks, that sounds right, but now I can't see how to fix it.  Is there
>> syntax to say:  match A only if it follows B, but don't match the B part?
>>
>> Duncan Murdoch
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>



More information about the R-help mailing list