[R] Regex Split?
Ivan Krylov
kry|ov@r00t @end|ng |rom gm@||@com
Fri May 5 12:24:36 CEST 2023
On Thu, 4 May 2023 23:59:33 +0300
Leonard Mada via R-help <r-help using r-project.org> wrote:
> strsplit("a bc,def, adef ,,gh", " |(?=,)|(?<=,)(?![ ])", perl=T)
> # "a" "bc" "," "def" "," "" "adef" "," "," "gh"
>
> strsplit("a bc,def, adef ,,gh", " |(?<! )(?=,)|(?<=,)(?![ ])", perl=T)
> # "a" "bc" "," "def" "," "" "adef" "," "," "gh"
>
> strsplit("a bc,def, adef ,,gh", " |(?<! )(?=,)|(?<=,)(?=[^ ])",
> perl=T)
> # "a" "bc" "," "def" "," "" "adef" "," "," "gh"
>
>
> Is this correct?
Perl seems to return the results you expect:
$ perl -E '
say("$_:\n ", join " ", map qq["$_"], split $_, q[a bc,def, adef ,,gh])
for (
qr[ |(?=,)|(?<=,)(?![ ])],
qr[ |(?<! )(?=,)|(?<=,)(?![ ])],
qr[ |(?<! )(?=,)|(?<=,)(?=[^ ])]
)'
(?^u: |(?=,)|(?<=,)(?![ ])):
"a" "bc" "," "def" "," "adef" "," "," "gh"
(?^u: |(?<! )(?=,)|(?<=,)(?![ ])):
"a" "bc" "," "def" "," "adef" "," "," "gh"
(?^u: |(?<! )(?=,)|(?<=,)(?=[^ ])):
"a" "bc" "," "def" "," "adef" "," "," "gh"
The same thing happens when I ask R to replace the separators instead
of splitting by them:
sapply(setNames(nm = c(
" |(?=,)|(?<=,)(?![ ])",
" |(?<! )(?=,)|(?<=,)(?![ ])",
" |(?<! )(?=,)|(?<=,)(?=[^ ])")
), gsub, '[]', "a bc,def, adef ,,gh", perl = TRUE)
# |(?=,)|(?<=,)(?![ ]) |(?<! )(?=,)|(?<=,)(?![ ])
# "a[]bc[],[]def[],[]adef[],[],[]gh" "a[]bc[],[]def[],[]adef[],[],[]gh"
# |(?<! )(?=,)|(?<=,)(?=[^ ])
# "a[]bc[],[]def[],[]adef[],[],[]gh"
I think that something strange happens when the delimeter pattern
matches more than once in the same place:
gsub(
'(?=<--)|(?<=-->)', '[]', 'split here --><-- split here',
perl = TRUE
)
# [1] "split here -->[]<-- split here"
(Both Perl's split() and s///g agree with R's gsub() here, although I
would have accepted "split here -->[][]<-- split here" too.)
On the other hand, the following doesn't look right:
strsplit(
'split here --><-- split here', '(?=<--)|(?<=-->)',
perl = TRUE
)
# [[1]]
# [1] "split here -->" "<" "-- split here"
The "<" is definitely not followed by "<--", and the rightmost "--" is
definitely not preceded by "-->".
Perhaps strsplit() incorrectly advances the match position after one
match?
--
Best regards,
Ivan
More information about the R-help
mailing list