[R] regexp mystery

Tue Oct 16 11:08:00 CEST 2018

On Tue, 16 Oct 2018 08:36:27 +0000
PIKAL Petr <petr.pikal using precheza.cz> wrote:

> > dput(x[11])  
> "et odYezko: 3                     \fas odYezku:   15 s"

> gsub("^.*: (\\d+).*$", "\\1", x[11])
> works for 3

This regular expression only matches one space between the colon and
the number, but you have more than one of them before "15".

> gsub("^.*[^:] (\\d+).*$", "\\1", x[11])
> works for 15

Match succeeds because a space is not a colon:

 ^.* matches "et odYezko: 3                     \fas odYezku:  "
 [^:] matches space " "
 space " " matches another space " "
 finally, (\\d+) matches "15"
 and .*$ matches " s"

If you need just the numbers, you might have more success by extracting
matches directly with gregexpr and regmatches:

(
	function(s) regmatches(
		s,
		gregexpr("\\d+(\\.\\d+)?", s)
	)
)("et odYezko: 3                     \fas odYezku:   15 s")

[[1]]
[1] "3"  "15"

(I'm creating an anonymous function and evaluating it immediately
because I need to pass the same string to both gregexpr and regmatches.)

If you need to capture numbers appearing in a specific context, a
better regular expression suiting your needs might be

":\\s*(\\d+(?:\\.\\d+)?)"

(A colon, followed by optional whitespace, followed by a number to
capture, consisting of decimals followed by optional, non-captured dot
followed by decimals)

but I couldn't find a way to extract captures from repeated match by
using vanilla R pattern matching (it's either regexec which returns
captures for the first match or gregexpr which returns all matches but
without the captures). If you can load the stringr package, it's very
easy, though:

str_match_all(
	c(
		"PYedehYev:  300 s              Záva~í: 2.160 kg",
		"et odYezko: 3               \fas odYezku:   15 s"
	),
	":\\s*(\\d+(?:\\.\\d+)?)"
)
[[1]]
     [,1]      [,2]   
[1,] ":  300"  "300"  
[2,] ": 2.160" "2.160"

[[2]]
     [,1]     [,2]
[1,] ": 3"    "3" 
[2,] ":   15" "15"

Column 2 of each list item contains the requested captures.

-- 
Best regards,
Ivan