[R] split strings
Wacek Kusnierczyk
Waclaw.Marcin.Kusnierczyk at idi.ntnu.no
Thu May 28 14:30:14 CEST 2009
(diverted to r-devel, a source code patch attached)
Wacek Kusnierczyk wrote:
> Allan Engelhardt wrote:
>
>> Immaterial, yes, but it is always good to test :) and your solution
>> *is* faster and it is even faster if you can assume byte strings:
>>
>
> :)
>
> indeed; though if the speed is immaterial (and in this case it
> supposedly was), it's probably not worth risking fixed=TRUE removing
> '.tif' from the middle of the name, however unlikely this might be (cf
> murphy's laws).
>
> but if you can assume that each string ends with a '.tif' (or any other
> \..{3} substring), then substr is marginally faster than sub, even as a
> three-pass approach, while avoiding the risk of removing '.tif' from the
> middle:
>
> strings = sprintf('f:/foo/bar//%s.tif', replicate(1000,
> paste(sample(letters, 10), collapse='')))
> library(rbenchmark)
> benchmark(columns=c('test', 'elapsed'), replications=1000, order=NULL,
> substr={basenames=basename(strings); substr(basenames, 1,
> nchar(basenames)-4)},
> sub=sub('.tif', '', basename(strings), fixed=TRUE, useBytes=TRUE))
> # test elapsed
> # 1 substr 3.176
> # 2 sub 3.296
>
btw., i wonder why negative indices default to 1 in substr:
substr('foobar', -5, 5)
# "fooba"
# substr('foobar', 1, 5)
substr('foobar', 2, -2)
# ""
# substr('foobar', 2, 1)
this does not seem to be documented in ?substr. there are ways to make
negative indices meaningful, e.g., by taking them as indexing from
behind (as in, e.g., perl):
# hypothetical
substr('foobar', -5, 5)
# "ooba"
# substr('foobar', 6-5+1, 5)
substr('foobar', 2, -2)
# "ooba"
# substr('foobar', 2, 6-2+1)
there is a trivial fix to src/main/character.c that gives substr the
extended functionality -- see the attached patch. the patch has been
created and tested as follows:
svn co https://svn.r-project.org/R/trunk r-devel
cd r-devel
# modifications made to src/main/character.c
svn diff > character.c.diff
svn revert -R .
patch -p0 < character.c.diff
./configure
make
make check-all
# no problems reported
with the patched substr, the original problem can now be solved more
concisely, using a two-pass approach, with performance still better than
the sub/fixed/bytes one, as follows:
strings = sprintf('f:/foo/bar//%s.tif', replicate(1000,
paste(sample(letters, 10), collapse='')))
library(rbenchmark)
benchmark(columns=c('test', 'elapsed'), replications=1000, order=NULL,
substr=substr(basename(strings), 1, -5),
'substr-nchar'={
basenames=basename(strings)
substr(basenames, 1, nchar(basenames)-4) },
sub=sub('.tif', '', basename(strings), fixed=TRUE, useBytes=TRUE))
# test elapsed
# 1 substr 2.981
# 2 substr-nchar 3.206
# 3 sub 3.273
if this sounds interesting, i can update the docs accordingly.
vQ
More information about the R-help
mailing list