[R] interval between specific characters in a string...
Hadley Wickham
h@w|ckh@m @end|ng |rom gm@||@com
Sun Dec 4 09:25:08 CET 2022
On Sun, Dec 4, 2022 at 12:50 PM Hervé Pagès <hpages.on.github using gmail.com> wrote:
>
> On 03/12/2022 07:21, Bert Gunter wrote:
> > Perhaps it is worth pointing out that looping constructs like lapply() can
> > be avoided and the procedure vectorized by mimicking Martin Morgan's
> > solution:
> >
> > ## s is the string to be searched.
> > diff(c(0,grep('b',strsplit(s,'')[[1]])))
> >
> > However, Martin's solution is simpler and likely even faster as the regex
> > engine is unneeded:
> >
> > diff(c(0, which(strsplit(s, "")[[1]] == "b"))) ## completely vectorized
> >
> > This seems much preferable to me.
>
> Of all the proposed solutions, Andrew Hart's solution seems the most
> efficient:
>
> big_string <- strrep("abaaabbaaaaabaaabaaaaaaaaaaaaaaaaaaab", 500000)
>
> system.time(nchar(strsplit(big_string, split="b", fixed=TRUE)[[1]]) + 1)
> # user system elapsed
> # 0.736 0.028 0.764
>
> system.time(diff(c(0, which(strsplit(big_string, "", fixed=TRUE)[[1]]
> == "b"))))
> # user system elapsed
> # 2.100 0.356 2.455
>
> The bigger the string, the bigger the gap in performance.
>
> Also, the bigger the average gap between 2 successive b's, the bigger
> the gap in performance.
>
> Finally: always use fixed=TRUE in strsplit() if you don't need to use
> the regex engine.
You can do a bit better if you are willing to use stringr:
library(stringr)
big_string <- strrep("abaaabbaaaaabaaabaaaaaaaaaaaaaaaaaaab", 500000)
system.time(nchar(strsplit(big_string, split="b", fixed=TRUE)[[1]]) + 1)
#> user system elapsed
#> 0.126 0.002 0.128
system.time(str_length(str_split(big_string, fixed("b"))[[1]]))
#> user system elapsed
#> 0.103 0.004 0.107
(And my timings also suggest that it's time for Hervé to get a new computer :P)
It feels like an approach that uses locations should be faster since
you wouldn't have to construct all the intermediate strings.
system.time(pos <- str_locate_all(big_string, fixed("b"))[[1]][,1])
#> user system elapsed
#> 0.075 0.004 0.080
# I suspect this could be optimised with a little thought making this approach
# faster overall
system.time(c(0, diff(pos))
#> user system elapsed
#> 0.022 0.006 0.027
Hadley
--
http://hadley.nz
More information about the R-help
mailing list