[R] Regular expressions: retrieving matches depending on intervening strings
Stefan Th. Gries
stgries_lists at arcor.de
Wed Aug 16 09:17:36 CEST 2006
Dear all
I again have a regular expression question. I have this character vector a:
a<-c("<w AT0>a <w NN1>blockage <w CJC>and <w DT0>that<c PUN>.",
"<w AT0>a <w NN1>blockage <w CJC>and <ptr target=KB2LC003><w DT0>that<c PUN>.",
"<w AT0>a <w NN1>blockage <w CJC>and<c PUN>, <w DT0>that<c PUN>.",
"<w AT0>a <w NN1>blockage <w CJC>and <w AJ0>hungry <w DT0>that<c PUN>.")
I would like to retrieve those elements of a in which "<w CJC>" and "<w DT0>" are
- directly adjacent, as in a[1] or
- not interrupted by "<[wc] ", as in a[2]
And, of these elements I would like to consume all characters from the "<" in "<w CJC" to the last character after "<w DT0>" that is not a "<". For example, if I was only searching a[1], I would like something like this:
matches<-gregexpr("<w CJC>[^<]+?<w DT0>[^<]+", a[1], perl=TRUE)
substr(a[1], unlist(matches), unlist(matches)+unlist(attributes(matches[[1]], "match.length"))-1)
I have been fiddling around with negative lookahead but I really can't get my head around this. Any pointers would be greatly appreciated. Thanks a lot,
STG
--
Stefan Th. Gries
-----------------------------------------------
University of California, Santa Barbara
http://www.linguistics.ucsb.edu/faculty/stgries
More information about the R-help
mailing list