[R] [Rd] split strings

Thu May 28 15:23:17 CEST 2009


Bill Dunlap
TIBCO Software Inc - Spotfire Division
wdunlap tibco.com  

> -----Original Message-----
> From: r-devel-bounces at r-project.org 
> [mailto:r-devel-bounces at r-project.org] On Behalf Of Wacek Kusnierczyk
> Sent: Thursday, May 28, 2009 5:30 AM
> Cc: R help project; r-devel at r-project.org; Allan Engelhardt
> Subject: Re: [Rd] [R] split strings
> 
> (diverted to r-devel, a source code patch attached)
> 
> Wacek Kusnierczyk wrote:
> > Allan Engelhardt wrote:
> >   
> >> Immaterial, yes, but it is always good to test :) and your solution
> >> *is* faster and it is even faster if you can assume byte strings:
> >>     
> >
> > :)
> >
> > indeed;  though if the speed is immaterial (and in this case it
> > supposedly was), it's probably not worth risking fixed=TRUE removing
> > '.tif' from the middle of the name, however unlikely this 
> might be (cf
> > murphy's laws).
> >
> > but if you can assume that each string ends with a '.tif' 
> (or any other
> > \..{3} substring), then substr is marginally faster than 
> sub, even as a
> > three-pass approach, while avoiding the risk of removing 
> '.tif' from the
> > middle:
> >
> >     strings = sprintf('f:/foo/bar//%s.tif', replicate(1000,
> > paste(sample(letters, 10), collapse='')))
> >     library(rbenchmark)
> >     benchmark(columns=c('test', 'elapsed'), 
> replications=1000, order=NULL,
> >        substr={basenames=basename(strings); substr(basenames, 1,
> > nchar(basenames)-4)},
> >        sub=sub('.tif', '', basename(strings), fixed=TRUE, 
> useBytes=TRUE))
> >     #     test elapsed
> >     # 1 substr   3.176
> >     # 2    sub   3.296
> >   
> 
> btw., i wonder why negative indices default to 1 in substr:
> 
>     substr('foobar', -5, 5)
>     # "fooba"
>     # substr('foobar', 1, 5)
>     substr('foobar', 2, -2)
>     # ""
>     # substr('foobar', 2, 1)
> 
> this does not seem to be documented in ?substr.

Would your patched code affect the following
use of regexpr's output as input to substr, to
pull out the matched text from the string?
   > x<-c("ooo","good food","bad")
   > r<-regexpr("o+", x)
   > substring(x,r,attr(r,"match.length")+r-1)
   [1] "ooo" "oo"  ""   
   > substr(x,r,attr(r,"match.length")+r-1)
   [1] "ooo" "oo"  ""   
   > r
   [1]  1  2 -1
   attr(,"match.length")
   [1]  3  2 -1
   > attr(r,"match.length")+r-1
   [1]  3  3 -3
   attr(,"match.length")
   [1]  3  2 -1

>  there are 
> ways to make
> negative indices meaningful, e.g., by taking them as indexing from
> behind (as in, e.g., perl):
> 
>     # hypothetical
>     substr('foobar', -5, 5)
>     # "ooba"
>     # substr('foobar', 6-5+1, 5)
>     substr('foobar', 2, -2)
>     # "ooba"
>     # substr('foobar', 2, 6-2+1)
> 
> there is a trivial fix to src/main/character.c that gives substr the
> extended functionality -- see the attached patch.  the patch has been
> created and tested as follows:
> 
>     svn co https://svn.r-project.org/R/trunk r-devel
>     cd r-devel
>     # modifications made to src/main/character.c
>     svn diff > character.c.diff
>     svn revert -R .
>     patch -p0 < character.c.diff
>    
>     ./configure
>     make
>     make check-all
>     # no problems reported
> 
> with the patched substr, the original problem can now be solved more
> concisely, using a two-pass approach, with performance still 
> better than
> the sub/fixed/bytes one, as follows:
> 
>     strings = sprintf('f:/foo/bar//%s.tif', replicate(1000,
>     paste(sample(letters, 10), collapse='')))
>     library(rbenchmark)
>     benchmark(columns=c('test', 'elapsed'), 
> replications=1000, order=NULL,
>         substr=substr(basename(strings), 1, -5),
>         'substr-nchar'={
>             basenames=basename(strings)
>             substr(basenames, 1, nchar(basenames)-4) },
>         sub=sub('.tif', '', basename(strings), fixed=TRUE, 
> useBytes=TRUE))
>     #     test elapsed
>     # 1       substr   2.981
>     # 2 substr-nchar   3.206
>     # 3          sub   3.273
> 
> if this sounds interesting, i can update the docs accordingly.
> 
> vQ
>