[R] Regex engine types

Prof Brian Ripley ripley at stats.ox.ac.uk
Mon Jun 12 09:52:58 CEST 2006

On Sat, 10 Jun 2006, Prof Brian Ripley wrote:

> ?regex does describe this:
>     A range of characters may be specified by giving the first and last
>     characters, separated by a hyphen.  (Character ranges are
>     interpreted in the collation order of the current locale.)
> You did not tell us your locale, but based on questions from you in the past 
> I would guess en_NZ.utf8.  In that locale the collation order is wWxXyYzZ, so 
> your surprise is explained.  (It seems the PCRE code is not using the same 
> ordering in that locale.)

Some digging shows that Perl does not say explicitly what order it uses 
(at least in the man pages on my system), but that PCRE uses (see man 

- numerical order of the bytes in a single-byte locale
- numerical order of Unicode points in a UTF-8 locale.

whereas the basic/extended code uses the order set by the locale category 
LC_COLLATE and interpreted by the C function wcscoll (and byte order 
if that is not available).

Gabor Grothendieck <ggrothendieck at gmail.com> worte:

> I get the same thing on "Version 2.3.1 Patched (2006-06-04 r38279)"
> but on "R version 2.2.1, 2005-12-20" it gives character(0), as
> expected, so there is some change between versions of R.  I am
> on Windows XP.

And a helpful person would have studied the CHANGES file before 
commenting!  It says:


   There is no longer a separate 'East Asian' version of R.dll.

In R 2.2.1 the fully internationalized version behaved as 2.3.1 did, but 
the 8-bit-only version for Windows always used byte-order collation. The 
difference is most likely that GG was using the 8-bit-only version, a 
Windows-specific issue.

Brian D. Ripley,                  ripley at stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595

More information about the R-help mailing list