[R] interval between specific characters in a string...

Sun Dec 4 03:33:05 CET 2022

Thanks. Very informative.
I certainly missed this.

-- Bert

On Sat, Dec 3, 2022 at 3:49 PM Hervé Pagès <hpages.on.github using gmail.com>
wrote:

> On 03/12/2022 07:21, Bert Gunter wrote:
> > Perhaps it is worth pointing out that looping constructs like lapply()
> can
> > be avoided and the procedure vectorized by mimicking Martin Morgan's
> > solution:
> >
> > ## s is the string to be searched.
> > diff(c(0,grep('b',strsplit(s,'')[[1]])))
> >
> > However, Martin's solution is simpler and likely even faster as the regex
> > engine is unneeded:
> >
> > diff(c(0, which(strsplit(s, "")[[1]] == "b"))) ## completely vectorized
> >
> > This seems much preferable to me.
>
> Of all the proposed solutions, Andrew Hart's solution seems the most
> efficient:
>
>    big_string <- strrep("abaaabbaaaaabaaabaaaaaaaaaaaaaaaaaaab", 500000)
>
>    system.time(nchar(strsplit(big_string, split="b", fixed=TRUE)[[1]]) + 1)
>    #    user  system elapsed
>    #   0.736   0.028   0.764
>
>    system.time(diff(c(0, which(strsplit(big_string, "", fixed=TRUE)[[1]]
> == "b"))))
>    #    user  system elapsed
>    #  2.100   0.356   2.455
>
> The bigger the string, the bigger the gap in performance.
>
> Also, the bigger the average gap between 2 successive b's, the bigger
> the gap in performance.
>
> Finally: always use fixed=TRUE in strsplit() if you don't need to use
> the regex engine.
>
> Cheers,
>
> H.
>
>
> > -- Bert
> >
> >
> >
> >
> >
> > On Sat, Dec 3, 2022 at 12:49 AM Rui Barradas <ruipbarradas using sapo.pt>
> wrote:
> >
> >> Às 17:18 de 02/12/2022, Evan Cooch escreveu:
> >>> Was wondering if there is an 'efficient/elegant' way to do the
> following
> >>> (without tidyverse). Take a string
> >>>
> >>> abaaabbaaaaabaaab
> >>>
> >>> Its easy enough to count the number of times the character 'b' shows up
> >>> in the string, but...what I'm looking for is outputing the 'intervals'
> >>> between occurrences of 'b' (starting the counter at the beginning of
> the
> >>> string). So, for the preceding example, 'b' shows up in positions
> >>>
> >>> 2, 6, 7, 13, 17
> >>>
> >>> So, the interval data would be: 2, 4, 1, 6, 4
> >>>
> >>> My main approach has been to simply output positions (say, something
> >>> like unlist(gregexpr('b', target_string))), and 'do the math' between
> >>> successive positions. Can anyone suggest a more elegant approach?
> >>>
> >>> Thanks in advance...
> >>>
> >>> ______________________________________________
> >>> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> >>> https://stat.ethz.ch/mailman/listinfo/r-help
> >>> PLEASE do read the posting guide
> >>> http://www.R-project.org/posting-guide.html
> >>> and provide commented, minimal, self-contained, reproducible code.
> >> Hello,
> >>
> >> I don't find your solution inelegant, it's even easy to write it as a
> >> one-line function.
> >>
> >>
> >> char_interval <- function(x, s) {
> >>     lapply(gregexpr(x, s), \(y) c(head(y, 1), diff(y)))
> >> }
> >>
> >> target_string <-"abaaabbaaaaabaaab"
> >> char_interval('b', target_string)
> >> #> [[1]]
> >> #> [1] 2 4 1 6 4
> >>
> >>
> >> Hope this helps,
> >>
> >> Rui Barradas
> >>
> >> ______________________________________________
> >> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> >> https://stat.ethz.ch/mailman/listinfo/r-help
> >> PLEASE do read the posting guide
> >> http://www.R-project.org/posting-guide.html
> >> and provide commented, minimal, self-contained, reproducible code.
> >>
> >       [[alternative HTML version deleted]]
> >
> > ______________________________________________
> > R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
>
> --
> Hervé Pagès
>
> Bioconductor Core Team
> hpages.on.github using gmail.com
>
>

	[[alternative HTML version deleted]]