[R] strsplit("dia ma", "\\b") splits characterwise
Gabor Grothendieck
ggrothendieck at gmail.com
Thu Jul 8 15:33:33 CEST 2010
On Thu, Jul 8, 2010 at 4:15 AM, Suharto Anggono Suharto Anggono
<suharto_anggono at yahoo.com> wrote:
> \b is word boundary.
> But, unexpectedly, strsplit("dia ma", "\\b") splits character by character.
>
>> strsplit("dia ma", "\\b")
> [[1]]
> [1] "d" "i" "a" " " "m" "a"
>
>> strsplit("dia ma", "\\b", perl=TRUE)
> [[1]]
> [1] "d" "i" "a" " " "m" "a"
>
>
> How can that be?
>
> This is the output of 'gregexpr'.
>
>> gregexpr("\\b", "dia ma")
> [[1]]
> [1] 1 2 3 4 5 6
> attr(,"match.length")
> [1] 0 0 0 0 0 0
>
>> gregexpr("\\b", "dia ma", perl=TRUE)
> [[1]]
> [1] 1 4 5 7
> attr(,"match.length")
> [1] 0 0 0 0
>
>
> The output from gregexpr("\\b", "dia ma", perl=TRUE) is what I expect. I expect 'strsplit' to split at that points.
You can use strapply in the gsubfn function to match all words and non-words:
library(gsubfn)
strapply("dia ma", "\\w+|\\W+", c) # c("dia", " ", "ma")
or all spaces and non-spaces:
strapply("dia ma", "\\s+|\\S+", c) # c("dia", " ", "ma")
More information about the R-help
mailing list