[R] vectorized sub, gsub, grep, etc.

Adam Erickson adam.michael.erickson at gmail.com
Wed Jul 29 23:42:23 CEST 2015


Further refining the vectorized (within a loop) exact string match 
function, I get times below 0.9 seconds while maintaining error checking. 
This is accomplished by removing which() and replacing 1:length() with 
seq_along().

sub2 <- function(pattern, replacement, x) {
   len    <- length(x)
   y      <- character(length=len)
   patlen <- length(pattern)
   replen <- length(replacement)
   if(patlen != replen) stop('Error: Pattern and replacement length do not 
match')
   for(i in seq_along(pattern)) {
     y[x==pattern[i]] <- replacement[i]
   }
   return(y)
 }

system.time(for(i in 1:50000) sub2(patt, repl, X))
   user  system elapsed 
   0.86    0.00    0.86 

Since the ordered vectors are perfectly aligned, might as well do an exact 
string match. Hence, I think this is not off-topic.

Cheers,

Adam

On Wednesday, July 29, 2015 at 8:15:52 AM UTC-7, Bert Gunter wrote:
>
> There is confusion here. apply() family functions are **NOT** 
> vectorization -- they ARE loops (at the interpreter level), just done 
> in "functionalized" form. Please read background material (John 
> Chambers's books, MASS, or numerous others) to improve your 
> understanding and avoid posting erroneous comments. 
>
> Cheers, 
> Bert 
>
>
> Bert Gunter 
>
> "Data is not information. Information is not knowledge. And knowledge 
> is certainly not wisdom." 
>    -- Clifford Stoll 
>
>
> On Tue, Jul 28, 2015 at 3:00 PM, John Thaden <jjth... at flash.net 
> <javascript:>> wrote: 
> > Adam,    The method you propose gives a different result than the prior 
> methods for these example vectors 
> > X <- c("ab", "cd", "ef") 
> > patt <- c("b", "cd", "a") 
> > repl <- c("B", "CD", "A") 
> > 
> > Old method 1 
> > 
> > mapply(function(p, r, x) sub(p, r, x, fixed = TRUE), p=patt, r=repl, 
> x=X) 
> > gives 
> >   b   cd    a 
> > "aB" "CD" "ef" 
> > 
> > Old method 2 
> > 
> > sub2 <- function(pattern, replacement, x) { 
> >     len <- length(x) 
> >     if (length(pattern) == 1) 
> >         pattern <- rep(pattern, len) 
> >     if (length(replacement) == 1) 
> >         replacement <- rep(replacement, len) 
> >     FUN <- function(i, ...) { 
> >         sub(pattern[i], replacement[i], x[i], fixed = TRUE) 
> >     } 
> >     idx <- 1:length(x) 
> >     sapply(idx, FUN) 
> > } 
> > sub2(patt, repl, X) 
> >  gives 
> > [1] "aB" "CD" "ef" 
> > 
> > Your method (I gave it the unique name "sub3") 
> >  sub3 <- function(pattern, replacement, x) {   len    <- length(x)  y   
>    <- character(length=len)  patlen <- length(pattern)  replen <- 
> length(replacement)  if(patlen != replen) stop('Error: Pattern and 
> replacement length do not match')  for(i in 1:replen) {   
>  y[which(x==pattern[i])] <- replacement[i]  }  return(y)}sub3(patt, repl, 
> X) 
> > gives[1] ""   "CD" "" 
> > 
> > Granted, whatever it does, it does it faster 
> > #Old method 1 
> > system.time(for(i in 1:50000) 
> > mapply(function(p,r,x) sub(p,r,x, fixed = TRUE),p=patt,r=repl,x=X)) 
> >    user  system elapsed 
> >    2.53    0.00    2.52 
> > 
> > #Old method 2 
> > system.time(for(i in 1:50000)sub2(patt, repl, X))   user  system elapsed 
> >    2.32    0.00    2.32 
> > 
> > #Your proposed method 
> > system.time(for(i in 1:50000) sub3(patt, repl, X)) 
> >    user  system elapsed 
> >    1.02    0.00    1.01 
> >  but would it still be faster if it actually solved the same problem? 
> > 
> > -John Thaden 
> > 
> > 
> > 
> > 
> >      On Monday, July 27, 2015 11:40 PM, Adam Erickson <
> adam.micha... at gmail.com <javascript:>> wrote: 
> > 
> > I know this is an old thread, but I wrote a simple FOR loop with 
> vectorized pattern replacement that is much faster than either of those (it 
> can also accept outputs differing in length from the patterns): 
> >   sub2  <- function(pattern, replacement, x) {     len   <- length(x)   
>  y      <- character(length=len)    patlen <- length(pattern)    replen <- 
> length(replacement)    if(patlen != replen) stop('Error: Pattern and 
> replacement length do not match')    for(i in 1:replen) {     
>  y[which(x==pattern[i])] <- replacement[i]    }    return(y)  } 
> > system.time(test <- sub2(patt, repl, XX))   user  system elapsed       0 
>       0       0 
> > Cheers, 
> > Adam 
> > On Wednesday, October 8, 2008 at 9:38:01 PM UTC-7, john wrote: 
> > Hello Christos, 
> >   To my surprise, vectorization actually hurt processing speed!#Example 
> > X <- c("ab", "cd", "ef") 
> > patt <- c("b", "cd", "a") 
> > repl <- c("B", "CD", "A")sub2 <- function(pattern, replacement, x) { 
> >     len <- length(x) 
> >     if (length(pattern) == 1) 
> >         pattern <- rep(pattern, len) 
> >     if (length(replacement) == 1) 
> >         replacement <- rep(replacement, len) 
> >     FUN <- function(i, ...) { 
> >         sub(pattern[i], replacement[i], x[i], fixed = TRUE) 
> >     } 
> >     idx <- 1:length(x) 
> >     sapply(idx, FUN) 
> > } 
> > 
> > system.time(  for(i in 1:10000)  sub2(patt, repl, X)  ) 
> >    user  system elapsed 
> >    1.18    0.07    1.26 system.time(  for(i in 1:10000) 
>  mapply(function(p, r, x) sub(p, r, x, fixed = TRUE), p=patt, r=repl, x=X) 
>  ) 
> >    user  system elapsed 
> >    1.42    0.05    1.47 
> > 
> > So much for avoiding loops. 
> > John Thaden======= At 2008-10-07, 14:58:10 Christos wrote: =======>John, 
> >>Try the following: 
> >> 
> >> mapply(function(p, r, x) sub(p, r, x, fixed = TRUE), p=patt, r=repl, 
> x=X) 
> >>   b   cd    a 
> >>"aB" "CD" "ef" 
> >> 
> >>-Christos>> -----My Original Message----- 
> >>> R pattern-matching and replacement functions are 
> >>> vectorized: they can operate on vectors of targets. 
> >>> However, they can only use one pattern and replacement. 
> >>> Here is code to apply a different pattern and replacement for 
> >>> every target.  My question: can it be done better? 
> >>> 
> >>> sub2 <- function(pattern, replacement, x) { 
> >>>     len <- length(x) 
> >>>     if (length(pattern) == 1) 
> >>>         pattern <- rep(pattern, len) 
> >>>     if (length(replacement) == 1) 
> >>>         replacement <- rep(replacement, len) 
> >>>     FUN <- function(i, ...) { 
> >>>         sub(pattern[i], replacement[i], x[i], fixed = TRUE) 
> >>>     } 
> >>>     idx <- 1:length(x) 
> >>>     sapply(idx, FUN) 
> >>> } 
> >>> 
> >>> #Example 
> >>> X <- c("ab", "cd", "ef") 
> >>> patt <- c("b", "cd", "a") 
> >>> repl <- c("B", "CD", "A") 
> >>> sub2(patt, repl, X) 
> >>> 
> >>> -John______________________________________________ 
> > R-h... at r-project.org mailing list 
> > https://stat.ethz.ch/mailman/listinfo/r-help 
> > PLEASE do read the posting guide 
> http://www.R-project.org/posting-guide.html 
> > and provide commented, minimal, self-contained, reproducible code. 
> > 
> > 
> > 
> > 
> >         [[alternative HTML version deleted]] 
> > 
> > ______________________________________________ 
> > R-h... at r-project.org <javascript:> mailing list -- To UNSUBSCRIBE and 
> more, see 
> > https://stat.ethz.ch/mailman/listinfo/r-help 
> > PLEASE do read the posting guide 
> http://www.R-project.org/posting-guide.html 
> > and provide commented, minimal, self-contained, reproducible code. 
>
> ______________________________________________ 
> R-h... at r-project.org <javascript:> mailing list -- To UNSUBSCRIBE and 
> more, see 
> https://stat.ethz.ch/mailman/listinfo/r-help 
> PLEASE do read the posting guide 
> http://www.R-project.org/posting-guide.html 
> and provide commented, minimal, self-contained, reproducible code. 
>


More information about the R-help mailing list