[R] vectorized sub, gsub, grep, etc.
Adam Erickson
adam.michael.erickson at gmail.com
Wed Jul 29 02:12:14 CEST 2015
Hi John,
The version I wrote performs vectorized full string matching and
replacement with some error checking and flexible inputs. I think there are
a lot of good reasons for using this method where possible (e.g., speed and
reduced complexity). Duly noted that it is different from the original
question, which I only skimmed. The previous versions you listed are both
actually faster than the function for this in stringr:
str_replace(X,patt,repl)
[1] "aB" "CD" "ef"
system.time(for(i in 1:50000) str_replace(X,patt,repl))
user system elapsed
5.51 0.00 5.79
However, it seems unrealistic that the vectors would be perfectly ordered
in this way for most applications. The previous listed code is faster than
other approaches because there are far fewer permutations and only the
first character is checked. Perhaps that was the intention? I find this
case to be rare. For data.tables, I prefer the := and like() function,
which uses grepl().
Cheers,
Adam
On Tue, Jul 28, 2015 at 3:00 PM, John Thaden <jjthaden at flash.net> wrote:
> Adam,
> The method you propose gives a different result than the prior methods
> for these example vectors
>
> X <- c("ab", "cd", "ef")
> patt <- c("b", "cd", "a")
> repl <- c("B", "CD", "A")
>
> Old method 1
>
> mapply(function(p, r, x) sub(p, r, x, fixed = TRUE), p=patt, r=repl, x=X)
>
> gives
>
>
> * b cd a "aB" "CD" "ef"*
>
> Old method 2
>
> sub2 <- function(pattern, replacement, x) {
> len <- length(x)
> if (length(pattern) == 1)
> pattern <- rep(pattern, len)
> if (length(replacement) == 1)
> replacement <- rep(replacement, len)
> FUN <- function(i, ...) {
> sub(pattern[i], replacement[i], x[i], fixed = TRUE)
> }
> idx <- 1:length(x)
> sapply(idx, FUN)
> }
> sub2(patt, repl, X)
>
> gives
>
> *[1] "aB" "CD" "ef"*
>
> Your method (I gave it the unique name "sub3")
>
> sub3 <- function(pattern, replacement, x) {
> len <- length(x)
> y <- character(length=len)
> patlen <- length(pattern)
> replen <- length(replacement)
> if(patlen != replen) stop('Error: Pattern and replacement length do not
> match')
> for(i in 1:replen) {
> y[which(x==pattern[i])] <- replacement[i]
> }
> return(y)
> }
> sub3(patt, repl, X)
>
> gives
>
> *[1] "" "CD" ""*
>
> Granted, whatever it does, it does it faster
>
> #Old method 1
> system.time(for(i in 1:50000)
> mapply(function(p,r,x) sub(p,r,x, fixed = TRUE),p=patt,r=repl,x=X))
>
> *user system elapsed 2.53 0.00 2.52 *
>
> #Old method 2
> system.time(for(i in 1:50000)
> sub2(patt, repl, X))
>
> *user system elapsed 2.32 0.00 2.32 *
>
> #Your proposed method
> system.time(for(i in 1:50000) sub3(patt, repl, X))
>
> *user system elapsed *
> * 1.02 0.00 1.01*
>
> but would it still be faster if it actually solved the same problem?
>
> -John Thaden
>
>
>
>
> On Monday, July 27, 2015 11:40 PM, Adam Erickson <
> adam.michael.erickson at gmail.com> wrote:
>
> I know this is an old thread, but I wrote a simple FOR loop with
> vectorized pattern replacement that is much faster than either of those (it
> can also accept outputs differing in length from the patterns):
>
> sub2 <- function(pattern, replacement, x) {
> len <- length(x)
> y <- character(length=len)
> patlen <- length(pattern)
> replen <- length(replacement)
> if(patlen != replen) stop('Error: Pattern and replacement length do
> not match')
> for(i in 1:replen) {
> y[which(x==pattern[i])] <- replacement[i]
> }
> return(y)
> }
>
> system.time(test <- sub2(patt, repl, XX))
> user system elapsed
> 0 0 0
>
> Cheers,
>
> Adam
>
> On Wednesday, October 8, 2008 at 9:38:01 PM UTC-7, john wrote:
>
> Hello Christos,
> To my surprise, vectorization actually hurt processing speed!
> #Example
> X <- c("ab", "cd", "ef")
> patt <- c("b", "cd", "a")
> repl <- c("B", "CD", "A")
> sub2 <- function(pattern, replacement, x) {
> len <- length(x)
> if (length(pattern) == 1)
> pattern <- rep(pattern, len)
> if (length(replacement) == 1)
> replacement <- rep(replacement, len)
> FUN <- function(i, ...) {
> sub(pattern[i], replacement[i], x[i], fixed = TRUE)
> }
> idx <- 1:length(x)
> sapply(idx, FUN)
> }
>
> system.time( for(i in 1:10000) sub2(patt, repl, X) )
> user system elapsed
> 1.18 0.07 1.26
> system.time( for(i in 1:10000) mapply(function(p, r, x) sub(p, r, x,
> fixed = TRUE), p=patt, r=repl, x=X) )
> user system elapsed
> 1.42 0.05 1.47
>
> So much for avoiding loops.
> John Thaden
> ======= At 2008-10-07, 14:58:10 Christos wrote: =======
> >John,
> >Try the following:
> >
> > mapply(function(p, r, x) sub(p, r, x, fixed = TRUE), p=patt, r=repl, x=X)
> > b cd a
> >"aB" "CD" "ef"
> >
> >-Christos
> >> -----My Original Message-----
> >> R pattern-matching and replacement functions are
> >> vectorized: they can operate on vectors of targets.
> >> However, they can only use one pattern and replacement.
> >> Here is code to apply a different pattern and replacement for
> >> every target. My question: can it be done better?
> >>
> >> sub2 <- function(pattern, replacement, x) {
> >> len <- length(x)
> >> if (length(pattern) == 1)
> >> pattern <- rep(pattern, len)
> >> if (length(replacement) == 1)
> >> replacement <- rep(replacement, len)
> >> FUN <- function(i, ...) {
> >> sub(pattern[i], replacement[i], x[i], fixed = TRUE)
> >> }
> >> idx <- 1:length(x)
> >> sapply(idx, FUN)
> >> }
> >>
> >> #Example
> >> X <- c("ab", "cd", "ef")
> >> patt <- c("b", "cd", "a")
> >> repl <- c("B", "CD", "A")
> >> sub2(patt, repl, X)
> >>
> >> -John
> ______________________________________________
> R-h... at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> <http://www.r-project.org/posting-guide.html>
> and provide commented, minimal, self-contained, reproducible code.
>
>
>
>
[[alternative HTML version deleted]]
More information about the R-help
mailing list