[R] glob2rx() {was: no bug in R2.1.0's list.files()}
Martin Maechler
maechler at stat.math.ethz.ch
Thu May 12 20:19:13 CEST 2005
>>>>> "BaRow" == Barry Rowlingson <B.Rowlingson at lancaster.ac.uk>
>>>>> on Thu, 12 May 2005 11:05:43 +0100 writes:
BaRow> Uwe Ligges wrote:
>> Please read about regular expressions (!!!) and try to
>> understand that ".txt" also finds "Not_a_txt_file.xls"
>> ....
BaRow> The confusion here is between regular expressions
BaRow> and wildcard expansion known as 'globbing'. The two
BaRow> things are very different, and use characters such as
BaRow> '*' '.' and '?' in different ways.
Exactly, I had devised a "glob" to "regexp" function many
years ago in order to help newbies make the transition.
That function, nowadays, called 'glob2rx' has been part of our
(CRAN) package "sfsmisc" and hence available to all via
install.packages("sfsmisc")
library("sfsmisc")
But it's quite simple (though not trivial to read for the
inexperienced because of the many escapes ("\") needed)
and it maybe helpful to see its code on R-help, below.
Then, this topic has lead me to add 2 (obvious in hindsight)
logical optional arguments to the function so that it now looks like
glob2rx <- function(pattern, trim.head = FALSE, trim.tail = TRUE)
{
## Purpose: Change "ls" aka "wildcard" aka "globbing" _pattern_ to
## Regular Expression (as in grep, perl, emacs, ...)
## -------------------------------------------------------------------------
## Author: Martin Maechler ETH Zurich, ~ 1991
## New version using [g]sub() : 2004
p <- gsub('\\.','\\\\.', paste('^', pattern, '$', sep=''))
p <- gsub('\\?', '.', gsub('\\*', '.*', p))
## these are trimming '.*$' and '^.*' - in most cases only for esthetics
if(trim.tail) p <- sub("\\.\\*\\$$", '', p)
if(trim.head) p <- sub("\\^\\.\\*", '', p)
p
}
So those confused newbies (and DOS long timers!)
could use
list.files(myloc, glob2rx("*.zip"), full=TRUE)
## (yes, make a habit of using 'TRUE', not 'T' ..)
The current example code, BTW, has
stopifnot(glob2rx("abc.*") == "^abc\\.",
glob2rx("a?b.*") == "^a.b\\.",
glob2rx("a?b.*", trim.tail=FALSE) == "^a.b\\..*$",
glob2rx("*.doc") == "^.*\\.doc$",
glob2rx("*.doc", trim.head=TRUE) == "\\.doc$",
glob2rx("*.t*") == "^.*\\.t",
glob2rx("*.t??") == "^.*\\.t..$"
)
Martin Maechler,
ETH Zurich
BaRow> There's added confusion when people come from a DOS
BaRow> background, where commands did their own thing when
BaRow> given '*' as parameter. The DOS command:
BaRow> RENAME *.FOO *.BAR
BaRow> did what seems obvious, renaming all the .FOO files
BaRow> to .BAR, but on a unix machine doing this with 'mv'
BaRow> can be destructive!
BaRow> In short (and slightly simplified), a '*' when
BaRow> expanded as a wildcard in a glob matches any string,
BaRow> whereas a '*' in a regular expression (regexp),
BaRow> matches the previous character 0 or more times. This
BaRow> is why "*.zip" is flagged as invalid now - there's no
BaRow> character before the "*".
BaRow> That should be enough clues to send you on your
BaRow> way.
BaRow> Baz
More information about the R-help
mailing list