[R] Regex Split?

Ivan Krylov kry|ov@r00t @end|ng |rom gm@||@com
Fri May 5 12:24:36 CEST 2023


On Thu, 4 May 2023 23:59:33 +0300
Leonard Mada via R-help <r-help using r-project.org> wrote:

> strsplit("a bc,def, adef ,,gh", " |(?=,)|(?<=,)(?![ ])", perl=T)
> # "a"    "bc"   ","    "def"  ","    ""     "adef" ","    "," "gh"
> 
> strsplit("a bc,def, adef ,,gh", " |(?<! )(?=,)|(?<=,)(?![ ])", perl=T)
> # "a"    "bc"   ","    "def"  ","    ""     "adef" ","    "," "gh"
> 
> strsplit("a bc,def, adef ,,gh", " |(?<! )(?=,)|(?<=,)(?=[^ ])",
> perl=T)
> # "a"    "bc"   ","    "def"  ","    ""     "adef" ","    "," "gh"
> 
> 
> Is this correct?

Perl seems to return the results you expect:

$ perl -E '
 say("$_:\n ", join " ", map qq["$_"], split $_, q[a bc,def, adef ,,gh])
 for (
  qr[ |(?=,)|(?<=,)(?![ ])],
  qr[ |(?<! )(?=,)|(?<=,)(?![ ])],
  qr[ |(?<! )(?=,)|(?<=,)(?=[^ ])]
)'
(?^u: |(?=,)|(?<=,)(?![ ])):
 "a" "bc" "," "def" "," "adef" "," "," "gh"
(?^u: |(?<! )(?=,)|(?<=,)(?![ ])):
 "a" "bc" "," "def" "," "adef" "," "," "gh"
(?^u: |(?<! )(?=,)|(?<=,)(?=[^ ])):
 "a" "bc" "," "def" "," "adef" "," "," "gh"

The same thing happens when I ask R to replace the separators instead
of splitting by them:

sapply(setNames(nm = c(
 " |(?=,)|(?<=,)(?![ ])",
 " |(?<! )(?=,)|(?<=,)(?![ ])",
 " |(?<! )(?=,)|(?<=,)(?=[^ ])")
), gsub, '[]', "a bc,def, adef ,,gh", perl = TRUE)
#               |(?=,)|(?<=,)(?![ ])         |(?<! )(?=,)|(?<=,)(?![ ]) 
# "a[]bc[],[]def[],[]adef[],[],[]gh" "a[]bc[],[]def[],[]adef[],[],[]gh" 
#        |(?<! )(?=,)|(?<=,)(?=[^ ]) 
# "a[]bc[],[]def[],[]adef[],[],[]gh" 

I think that something strange happens when the delimeter pattern
matches more than once in the same place:

gsub(
 '(?=<--)|(?<=-->)', '[]', 'split here --><-- split here',
 perl = TRUE
)
# [1] "split here -->[]<-- split here"

(Both Perl's split() and s///g agree with R's gsub() here, although I
would have accepted "split here -->[][]<-- split here" too.)

On the other hand, the following doesn't look right:

strsplit(
 'split here --><-- split here', '(?=<--)|(?<=-->)',
 perl = TRUE
)
# [[1]]
# [1] "split here -->" "<"              "-- split here"

The "<" is definitely not followed by "<--", and the rightmost "--" is
definitely not preceded by "-->".

Perhaps strsplit() incorrectly advances the match position after one
match?

-- 
Best regards,
Ivan



More information about the R-help mailing list