Sun Dec 4 01:21:48 CET 2022
This may be a fairly dumb and often asked question about some functions like strsplit() that return a list of things, often a list of ONE thing that be another list or a vector and needs to be made into something simpler..
The examples shown below have used various methods to convert the result to a vector but why is this not a built-in option for such a function to simplify the result either when possible or always?
Sure you can subset it with " [[1]]" or use unlist() or as.vector() to coerce it back to a vector. But when you have a very common idiom and a fact that many people waste lots of time figuring out they had a LIST containing a single vector and debug, maybe it would have made sense to have either a sister function like strsplit_v() that returns what is actually wanted or allow strsplit(whatever, output="vector") or something giving the same result.
Yes, I understand that when there is a workaround, it just complicates the base, but there could be a package that consistently does things like this to make the use of such functions easier.
On 03/12/2022 07:21, Bert Gunter wrote:
> Perhaps it is worth pointing out that looping constructs like lapply()
> can be avoided and the procedure vectorized by mimicking Martin
> Morgan's
> solution:
> ## s is the string to be searched.
> diff(c(0,grep('b',strsplit(s,'')[[1]])))
>
> However, Martin's solution is simpler and likely even faster as the
> regex engine is unneeded:
>
> diff(c(0, which(strsplit(s, "")[[1]] == "b"))) ## completely
> vectorized
>
> This seems much preferable to me.
Of all the proposed solutions, Andrew Hart's solution seems the most
efficient:
big_string <- strrep("abaaabbaaaaabaaabaaaaaaaaaaaaaaaaaaab", 500000)
system.time(nchar(strsplit(big_string, split="b", fixed=TRUE)[[1]]) + 1)
# user system elapsed
# 0.736 0.028 0.764
system.time(diff(c(0, which(strsplit(big_string, "", fixed=TRUE)[[1]] == "b"))))
# user system elapsed
# 2.100 0.356 2.455
The bigger the string, the bigger the gap in performance.
Also, the bigger the average gap between 2 successive b's, the bigger the gap in performance.
Finally: always use fixed=TRUE in strsplit() if you don't need to use the regex engine.
Cheers,
H.
> On Sat, Dec 3, 2022 at 12:49 AM Rui Barradas <ruipbarradas using sapo.pt> wrote:
>
>> Às 17:18 de 02/12/2022, Evan Cooch escreveu:
>>> Was wondering if there is an 'efficient/elegant' way to do the
>>> following (without tidyverse). Take a string
>>>
>>> abaaabbaaaaabaaab
>>>
>>> Its easy enough to count the number of times the character 'b' shows
>>> up in the string, but...what I'm looking for is outputing the 'intervals'
>>> between occurrences of 'b' (starting the counter at the beginning of
>>> the string). So, for the preceding example, 'b' shows up in
>>> positions
>>>
>>> 2, 6, 7, 13, 17
>>>
>>> So, the interval data would be: 2, 4, 1, 6, 4
>>>
>>> My main approach has been to simply output positions (say, something
>>> like unlist(gregexpr('b', target_string))), and 'do the math'
>>> between successive positions. Can anyone suggest a more elegant approach?
>>>
>>> Thanks in advance...
>>>
>> Hello,
>>
>> I don't find your solution inelegant, it's even easy to write it as a
>> one-line function.
>>
>>
>> char_interval <- function(x, s) {
>> lapply(gregexpr(x, s), \(y) c(head(y, 1), diff(y))) }
>>
>> target_string <-"abaaabbaaaaabaaab"
>> char_interval('b', target_string)
>> #> [[1]]
>> #> [1] 2 4 1 6 4
>>
>>
>> Hope this helps,
>>
>> Rui Barradas
>>
