[R] RfW 2.3.1: regular expressions to detect pairs of identical word-final character sequences
Stefan Th. Gries
stgries_lists at arcor.de
Sun Jul 23 03:48:47 CEST 2006
Dear all
I use R for Windows 2.3.1 on a fully updated Windows XP Home SP2 machine and I have two related regular expression problems.
platform i386-pc-mingw32
arch i386
os mingw32
system i386, mingw32
status
major 2
minor 3.1
year 2006
month 06
day 01
svn rev 38247
language R
version.string Version 2.3.1 (2006-06-01)
I would like to find cases of words in elements of character vectors that end in the same character sequences; if I find such cases, I want to add <r> to both potentially rhyming sequences. An example:
INPUT:This is my dog.
DESIRED OUTPUT: This<r> is<r> my dog.
I found a solution for cases where the potentially rhyming words are adjacent:
text<-"This is my dog."
gsub("(\\w+?)(\\W\\w+?)\\1(\\W)", "\\1<r>\\2\\1<r>\\3", text, perl=TRUE)
However, with another text vector, I came across two problems I cannot seem to solve and for which I would love to get some input.
(i) While I know what to do for non-adjacent words in general
gsub("(\\w+?)(\\W.+?)\\1(\\W)", "\\1<r>\\2\\1<r>\\3", "This not is my dog", perl=TRUE) # I know this is not proper English ;-)
this runs into problems with overlapping matches:
text<-"And this is the second sentence"
gsub("(\\w+?)(\\W.+?)\\1(\\W)", "\\1<r>\\2\\1<r>\\3", text, perl=TRUE)
[1] "And<r> this is the second<r> sentence"
It finds the "nd" match, but since the "is" match is within the two "nd"'s, it doesn't get it. Any ideas on how to get all pairwise matches?
(ii) How would one tell R to match only when there are 2+ characters matching? If the above expression is applied to another character string
text<-"this is an example sentence."
gsub("(\\w+?)(\\W.+?)\\1(\\W)", "\\1<r>\\2\\1<r>\\3", text, perl=TRUE)
it also matches the "e"'s at the end of example and sentence. It's not possible to get rid of that by specifying a range such as {2,}
text<-"this is an example sentence."
gsub("(\\w{2,}?)(\\W.+?)\\1(\\W)", "\\1<r>\\2\\1<r>\\3", text, perl=TRUE)
because, as I understand it, this requires the 2+ cases of \\w to be identical characters:
text<-"doo yoo see mee?"
gsub("(\\w{2,}?)(\\W.+?)\\1(\\W)", "\\1<r>\\2\\1<r>\\3", text, perl=TRUE)
Again, any ideas?
I'd really appreciate any snippets of codes, pointers, etc.
Thanks so much,
STG
--
Stefan Th. Gries
-----------------------------------------------
University of California, Santa Barbara
http://www.linguistics.ucsb.edu/faculty/stgries
More information about the R-help
mailing list