[R] vectorized sub, gsub, grep, etc.

John Thaden jjthaden at flash.net
Wed Jul 29 00:00:46 CEST 2015


Adam,    The method you propose gives a different result than the prior methods for these example vectors
X <- c("ab", "cd", "ef")
patt <- c("b", "cd", "a")
repl <- c("B", "CD", "A")

Old method 1

mapply(function(p, r, x) sub(p, r, x, fixed = TRUE), p=patt, r=repl, x=X)
gives
  b   cd    a 
"aB" "CD" "ef"

Old method 2 

sub2 <- function(pattern, replacement, x) {
    len <- length(x)
    if (length(pattern) == 1) 
        pattern <- rep(pattern, len)
    if (length(replacement) == 1) 
        replacement <- rep(replacement, len)
    FUN <- function(i, ...) {
        sub(pattern[i], replacement[i], x[i], fixed = TRUE)
    }
    idx <- 1:length(x)
    sapply(idx, FUN)    
}
sub2(patt, repl, X)
 gives
[1] "aB" "CD" "ef"

Your method (I gave it the unique name "sub3")
 sub3 <- function(pattern, replacement, x) {   len    <- length(x)  y      <- character(length=len)  patlen <- length(pattern)  replen <- length(replacement)  if(patlen != replen) stop('Error: Pattern and replacement length do not match')  for(i in 1:replen) {    y[which(x==pattern[i])] <- replacement[i]  }  return(y)}sub3(patt, repl, X)
gives[1] ""   "CD" ""

Granted, whatever it does, it does it faster
#Old method 1
system.time(for(i in 1:50000)
mapply(function(p,r,x) sub(p,r,x, fixed = TRUE),p=patt,r=repl,x=X))
   user  system elapsed 
   2.53    0.00    2.52 
 
#Old method 2
system.time(for(i in 1:50000)sub2(patt, repl, X))   user  system elapsed 
   2.32    0.00    2.32 
 
#Your proposed method
system.time(for(i in 1:50000) sub3(patt, repl, X))
   user  system elapsed 
   1.02    0.00    1.01
 but would it still be faster if it actually solved the same problem?

-John Thaden
 



     On Monday, July 27, 2015 11:40 PM, Adam Erickson <adam.michael.erickson at gmail.com> wrote:
   
I know this is an old thread, but I wrote a simple FOR loop with vectorized pattern replacement that is much faster than either of those (it can also accept outputs differing in length from the patterns):
  sub2  <- function(pattern, replacement, x) {     len   <- length(x)    y      <- character(length=len)    patlen <- length(pattern)    replen <- length(replacement)    if(patlen != replen) stop('Error: Pattern and replacement length do not match')    for(i in 1:replen) {      y[which(x==pattern[i])] <- replacement[i]    }    return(y)  }
system.time(test <- sub2(patt, repl, XX))   user  system elapsed       0       0       0 
Cheers,
Adam
On Wednesday, October 8, 2008 at 9:38:01 PM UTC-7, john wrote:
Hello Christos,
  To my surprise, vectorization actually hurt processing speed!#Example
X <- c("ab", "cd", "ef")
patt <- c("b", "cd", "a")
repl <- c("B", "CD", "A")sub2 <- function(pattern, replacement, x) {
    len <- length(x)
    if (length(pattern) == 1) 
        pattern <- rep(pattern, len)
    if (length(replacement) == 1) 
        replacement <- rep(replacement, len)
    FUN <- function(i, ...) {
        sub(pattern[i], replacement[i], x[i], fixed = TRUE)
    }
    idx <- 1:length(x)
    sapply(idx, FUN)    
}
 
system.time(  for(i in 1:10000)  sub2(patt, repl, X)  )
   user  system elapsed 
   1.18    0.07    1.26 system.time(  for(i in 1:10000)  mapply(function(p, r, x) sub(p, r, x, fixed = TRUE), p=patt, r=repl, x=X)  )
   user  system elapsed 
   1.42    0.05    1.47 
 
So much for avoiding loops.
John Thaden======= At 2008-10-07, 14:58:10 Christos wrote: =======>John,
>Try the following:
>
> mapply(function(p, r, x) sub(p, r, x, fixed = TRUE), p=patt, r=repl, x=X)
>   b   cd    a 
>"aB" "CD" "ef"  
>
>-Christos>> -----My Original Message-----
>> R pattern-matching and replacement functions are
>> vectorized: they can operate on vectors of targets.
>> However, they can only use one pattern and replacement.
>> Here is code to apply a different pattern and replacement for 
>> every target.  My question: can it be done better?
>> 
>> sub2 <- function(pattern, replacement, x) {
>>     len <- length(x)
>>     if (length(pattern) == 1) 
>>         pattern <- rep(pattern, len)
>>     if (length(replacement) == 1) 
>>         replacement <- rep(replacement, len)
>>     FUN <- function(i, ...) {
>>         sub(pattern[i], replacement[i], x[i], fixed = TRUE)
>>     }
>>     idx <- 1:length(x)
>>     sapply(idx, FUN)    
>> }
>> 
>> #Example
>> X <- c("ab", "cd", "ef")
>> patt <- c("b", "cd", "a")
>> repl <- c("B", "CD", "A")
>> sub2(patt, repl, X)
>> 
>> -John______________________________________________
R-h... at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.



  
	[[alternative HTML version deleted]]



More information about the R-help mailing list