[R] Why do my regular expressions require a double escape \\ to get a literal??
Berend Hasselman
bhh at xs4all.nl
Fri Mar 2 15:31:17 CET 2012
On 02-03-2012, at 14:13, Roey Angel wrote:
> Hi Bernard, thanks for the quick reply.
> Of course, I understand that an escape is needed because parenthesis are reserved symbols in regular expressions.
> My problem is that if I just use \( I get the error:
>
> Error: '\(' is an unrecognized escape in character string starting "\("
>
> so in order to get a literal ( I need to use \\(
> which is odd cause I've never encountered that in any other language and also all the R manuals dont mention that.
>
It is not odd as the previous poster has already mentioned.
I have encountered this (e.g. awk).
You need the \\ because the expression between tour quotes is interpreted twice:
once and first as a character string (in which \( is illegal but \\ is legal) and then as a regular expression in which you want to match a literal ( and ) which must be escaped in the regular expression since they are meta characters.
If you don't like doing that (the \\) use this instead
as.data.frame(apply(tax.data, 2, function(x) gsub('[(].*[)]','',x)))
i.e. put the ( and ) in a character class.
Berend
>> On 02-03-2012, at 09:36, Roey Angel wrote:
>>
>>> Hi,
>>> I was recently misfortunate enough to have to use regular expressions to sort out some data in R.
>>> I'm working on a data file which contains taxonomical data of bacteria in hierarchical order.
>>> A sample of this file can be generated using:
>>>
>>> tax.data<- read.table(header=F, con<- textConnection('
>>> G9SS7BA01D15EC Bacteria(100) Cyanobacteria(84) unclassified
>>> G9SS7BA01C9UIR Bacteria(100) Proteobacteria(94) Alphaproteobacteria(89)
>>> G9SS7BA01CM00D Bacteria(100) Proteobacteria(99) Alphaproteobacteria(99)
>>> '))
>>> close(con)
>>>
>>> What I try to do is to remove the parenthesis and the number inside (which could contain a decimal point)
>>> I assumed that the following command would solve it, but instead I got an error.
>>>
>>> tax.data<- as.data.frame(apply(tax.data, 2, function(x) gsub('\(.*\)','',x)))
>>> Error: '\(' is an unrecognized escape in character string starting "\("
>>>
>>> And it doesn't matter if I use perl = TRUE or not.
>>> To solve it I need to use a double escape sign '\\' before opening and closing the parenthesis:
>>>
>>> tax.data<- as.data.frame(apply(tax.data, 2, function(x) gsub('\\(.*\\)','',x)))
>>>
>>> This yields the desired result but I wonder why it does that?
>>> No other regular expression system I'm used to (e.g. Perl, Shell) works like that.
>>>
>>> I'm using R 2.14 (but also R 2.10) and I get the same results on Ubuntu and win XP.
>>>
>>> I'd appreciate any explanation.
>> Section "Character vectors" in the R Intro manual.
>>
>> ?Quotes
>>
>> The regular expression is provided as a string to gsub. In strings there are escape sequences.
>> To get the \ as a single \ to the regular expression parser it has to be \-ed in the string stage: \\
>>
>> Berend
>>
>>
> <angel.vcf>
More information about the R-help
mailing list