[R] gsub syntax

Sundar Dorai-Raj sundar.dorai-raj at pdf.com
Sun Nov 27 11:37:41 CET 2005



John Logsdon wrote:
> Hello
> 
> I know that R's string functions are not as extensive as those of Unix but
> I need to do some text handling totally within an R environment because
> the target is a Windows system which will not have the corresponding shell
> utilities, sed, awk etc.
> 
> Can anyone explain the following gsub phenomenon to me:
> 
> 
>>dates<-c("73","74","02","1973","1974","2002")
> 
> 
> I want to take just the last two digits where it is a 4-digit year and
> both digits when it is a 2-digit year.  I should be able to use substr but
> measurement from the string end (with a negative counter or something) is
> not implemented:
> 
> 
>>substr(dates,3,4)
> 
> [1] ""   ""   ""   "73" "74" "02"
> 
>>substr(dates,-2,4)
> 
> [1] "73"   "74"   "02"   "1973" "1974" "2002"
> 
>>substr(dates,4,-2)
> 
> [1] "" "" "" "" "" ""
> 
> So I tried gsub:
> 
> 
>>gsub("[19|20]([0-9][0-9])","\\1",dates)
> 
> [1] "73"  "74"  "02"  "973" "974" "002"
> 
> As I understand it (and comparing with sed), the \\1 should take the first
> bracketed string but clearly this doesn't work.  If I try what should also
> work:
> 
> 
>>gsub("[19|20]([0-9])([0-9])","\\1\\2",dates)
> 
> [1] "73"  "74"  "02"  "973" "974" "002"
> 
> On the other hand the following does work:
> 
> 
>>gsub("[19|20]([0-9])([0-9])","\\2",dates) 
> 
> [1] "73" "74" "02" "73" "74" "02"
> 
> So it appears that the substitution takes one character extra to the left
> but the following indicates that the lower limit of the selected range is
> also at fault:
> 
> 
>>s<-c("1","12","123","1234","12345","123456")
>>gsub("[12]([4-6]*)","",s)
> 
> [1] ""     ""     "3"    "34"   "345"  "3456"
> 
> Probably more elegant examples could be constructed that could home in on
> the issue.
> 
> The version is R 2.0.1 on Linux so perhaps it is a little old now.
> 
> Questions:
> 
> 1) Am I misunderstanding the gsub use?
> 
> 2) Was it a bug that has since been corrected?
> 
> 3) Is it still a bug in the latest version?
> 
> TIA
> 
> JOhn
>

Hi, John,

I cannot comment on your questions since I'm no regexpr guru. However, 
it seems to me you can do the following instead:

gsub(".*([0-9][0-9])", "\\1", dates)

This works fine on Linux & Windows, R-2.2.0.

HTH,

--sundar




More information about the R-help mailing list