[R] Locating the starting position of the first number in a string

Tue Nov 3 01:18:02 CET 2015

The regular expression you are looking for is
  \d{5}
... a "digit" repeated five times.
Note that you have to escape the escape in an R string.

But your example does not conform to the description: you have examples with six digit numbers: IBBS3_MSM_HN104213.

If there is length variation, I would just search for 
   \d+     (at least one) or 
   \d{5,}  (at least five)

And even though you send a vector with some hundred elements, it doesn't actually contain the choices you are asking for ???
Finally, I'm not sure why you want the "starting" positions, rather than the keys you find.

Your sample code is not at all how one does this. Define the three elements that you want to capture, put them in parentheses and evaluate the matches that regexec() returns. Also give us a smaller example, but one that contains all of the relevant cases.

ID <- c(
"IBBS3_MSM_HN01209",
"IBBS3_PWID_HN01210",
"IBBS3_MSM_HMC01211",
"IBBS3_PWID_HMC10212")

# now consider the regular expression:
regexec(".+((MSM)|(PWID))_((HN)|(HMC))(\\d+)", ID[1])

# This is:
#   any character one or more times,
#   followed by either MSM OR PWID,
#   followed by an underscore, 
#   followed by either HN OR HMC,
#   followed by one or more digits

# Look at the result: it's a list. The first vector of each list element
# gives you the starting positions, the second one gives you the match lengths.

# Compare:
regexec(".+((MSM)|(PWID))_((HN)|(HMC))(\\d+)", ID[3])

# Following the logic of the nested parentheses,
# you are looking for matches in position 2, 5 and
# 8 of your expression.

result <- matrix(numeric(3 * length(ID)), ncol=3)
colnames(result) <- c("TYPE", "GROUP", "ID")

for (i in 1:length(ID)) {
     m <- regexec(".+((MSM)|(PWID))_((HN)|(HMC))(\\d+)", ID[i])
     result[i,] <- m[[1]][c(2, 5, 8)] # write the three starting
                                      # positions into a row
                                      # of your matrix 
}

# of course its trivial now to actually capture
# the keys but that's not what you asked for...

B.

On Nov 2, 2015, at 1:39 PM, Jennifer Sabatier <plessthanpointohfive at gmail.com> wrote:

> Hi,
> 
> 
> So, I've got a vector of strings that look like this:
> ID <- c("IBBS3_MSM_HN01209","IBBS3_MSM_HN01210","IBBS3_MSM_HN01211",
> "IBBS3_MSM_HN10212","IBBS3_MSM_HN104213","IBBS3_MSM_HN10214",
> "IBBS3_MSM_HN44215","IBBS3_MSM_HN44216","IBBS3_MSM_HN44217",
> "IBBS3_MSM_HN44218","IBBS3_MSM_HN44219","IBBS3_MSM_HN44220",
> "IBBS3_MSM_HN44221","IBBS3_MSM_HN44222","IBBS3_MSM_HN44223",
> "IBBS3_MSM_HN44224","IBBS3_MSM_HN44225","IBBS3_MSM_HN44226",
> "IBBS3_MSM_HN44227","IBBS3_MSM_HN12228","IBBS3_MSM_HN12229",
> "IBBS3_MSM_HN12230","IBBS3_MSM_HN12231","IBBS3_MSM_HN12232",
> "IBBS3_MSM_HN12233","IBBS3_MSM_HN12234","IBBS3_MSM_HN12235",
> "IBBS3_MSM_HN12236","IBBS3_MSM_HN12237","IBBS3_MSM_HN12238",
> "IBBS3_MSM_HN12239","IBBS3_MSM_HN12240","IBBS3_MSM_HN12241",
> "IBBS3_MSM_HN12242","IBBS3_MSM_HN12243","IBBS3_MSM_HN12244",
> "IBBS3_MSM_HN12245","IBBS3_MSM_HN12246","IBBS3_MSM_HN12247",
> "IBBS3_MSM_HN12248","IBBS3_MSM_HN12249","IBBS3_MSM_HN12250",
> "IBBS3_MSM_HN12251","IBBS3_MSM_HN12252","IBBS3_MSM_HN12253",
> "IBBS3_MSM_HN12254","IBBS3_MSM_HN12255","IBBS3_MSM_HN25256",
> "IBBS3_MSM_HN25257","IBBS3_MSM_HN25258","IBBS3_MSM_HN25259",
> "IBBS3_MSM_HN25260","IBBS3_MSM_HN25261","IBBS3_MSM_HN25262",
> "IBBS3_MSM_HN25263","IBBS3_MSM_HN25264","IBBS3_MSM_HN25265",
> "IBBS3_MSM_HN25266","IBBS3_MSM_HN25267","IBBS3_MSM_HN25268",
> "IBBS3_MSM_HN25269","IBBS3_MSM_HN25270","IBBS3_MSM_HN25271",
> "IBBS3_MSM_HN25272","IBBS3_MSM_HN25273","IBBS3_MSM_HN25274",
> "IBBS3_MSM_HN25275","IBBS3_MSM_HN25276", "IBBS3_MSM_HN25277",
> "IBBS3_MSM_HN25278","IBBS3_MSM_HN25279","IBBS3_MSM_HN25280",
> "IBBS3_MSM_HN25281","IBBS3_MSM_HN25282","IBBS3_MSM_HN25283",
> "IBBS3_MSM_HN25284","IBBS3_MSM_HMC44285",  "IBBS3_MSM_HMC44286",
> "IBBS3_MSM_HMC44287","IBBS3_MSM_HMC44288","IBBS3_MSM_HMC44289",
> "IBBS3_MSM_HMC44290","IBBS3_MSM_HMC44291","IBBS3_MSM_HMC44292",
> "IBBS3_MSM_HMC44293","IBBS3_MSM_HMC44294","IBBS3_MSM_HMC44295",
> "IBBS3_MSM_HMC44296","IBBS3_MSM_HMC44297","IBBS3_MSM_HMC44298",
> "IBBS3_MSM_HMC44299","IBBS3_MSM_HMC44300","IBBS3_MSM_HMC44301",
> "IBBS3_MSM_HMC44302","IBBS3_MSM_HMC44303","IBBS3_MSM_HMC44304",
> "IBBS3_MSM_HMC44305","IBBS3_MSM_HMC44306","IBBS3_MSM_HMC44307",
> "IBBS3_MSM_HMC44309")
> 
> 
> 
> 
> This is an ID that is in the following format:  IBBS3_Type_Group#####
> 
> 
> What I want to do is locate the starting position of Type, which is
> anywhere from 3 to 4 letters long (in this example it's either MSM or
> PWID), the starting position of Group which is 2-3 letters long (either HN
> or HMC), and finally the starting position of the 5-digit number.
> 
> 
> I'm able to get Type and Group using the following:
> 
> 
> TYPE_s <- sapply(c("MSM", "PWID"), regexpr, ID, ignore.case=T)
> 
> GROUP_s <- (sapply(c("HN", "HMC"), regexpr, ID, ignore.case=T))
> 
> 
> What I am having trouble with is getting the starting position of the
> 5-digit number.
> 
> 
> I am trying:
> 
> 
> DIGITS_s <- sapply("([0:9])", regexpr, ID, ignore.case=T)
> 
> 
> But that just seems to look for the position of the first 0.:
> 
> 
>> DIGITS_s
> 
>       ([0:9])
> 
>  [1,]      13
> 
>  [2,]      13
> 
>  [3,]      13
> 
>  [4,]      14
> 
>  [5,]      14
> 
>  [6,]      14
> 
>  [7,]      -1
> 
>  [8,]      -1
> 
>  [9,]      -1
> 
> [10,]      -1
> 
> [11,]      17
> 
> [12,]      17
> 
> [13,]      -1
> 
> [14,]      -1
> 
> [15,]      -1
> 
> [16,]      -1
> 
> [17,]      -1
> 
> [18,]      -1
> 
> [19,]      -1
> 
> [20,]      -1
> 
> [21,]      17
> 
> [22,]      17
> 
> [23,]      -1
> 
> [24,]      -1
> 
> [25,]      -1
> 
> [26,]      -1
> 
> [27,]      -1
> 
> [28,]      -1
> 
> [29,]      -1
> 
> [30,]      -1
> 
> [31,]      17
> 
> [32,]      17
> 
> [33,]      -1
> 
> [34,]      -1
> 
> [35,]      -1
> 
> [36,]      -1
> 
> [37,]      -1
> 
> [38,]      -1
> 
> [39,]      -1
> 
> [40,]      -1
> 
> [41,]      17
> 
> [42,]      17
> 
> [43,]      -1
> 
> [44,]      -1
> 
> [45,]      -1
> 
> [46,]      -1
> 
> [47,]      -1
> 
> [48,]      -1
> 
> [49,]      -1
> 
> [50,]      -1
> 
> [51,]      17
> 
> [52,]      17
> 
> [53,]      -1
> 
> [54,]      -1
> 
> [55,]      -1
> 
> [56,]      -1
> 
> [57,]      -1
> 
> [58,]      -1
> 
> [59,]      -1
> 
> [60,]      -1
> 
> [61,]      17
> 
> [62,]      17
> 
> [63,]      -1
> 
> [64,]      -1
> 
> [65,]      -1
> 
> [66,]      -1
> 
> [67,]      -1
> 
> [68,]      -1
> 
> [69,]      -1
> 
> [70,]      -1
> 
> [71,]      17
> 
> [72,]      17
> 
> [73,]      -1
> 
> [74,]      -1
> 
> [75,]      -1
> 
> [76,]      -1
> 
> [77,]      -1
> 
> [78,]      -1
> 
> [79,]      -1
> 
> [80,]      -1
> 
> [81,]      18
> 
> [82,]      17
> 
> [83,]      17
> 
> [84,]      17
> 
> [85,]      17
> 
> [86,]      17
> 
> [87,]      17
> 
> [88,]      17
> 
> [89,]      17
> 
> [90,]      17
> 
> [91,]      17
> 
> [92,]      17
> 
> [93,]      17
> 
> [94,]      17
> 
> [95,]      17
> 
> [96,]      17
> 
> [97,]      17
> 
> [98,]      17
> 
> [99,]      17
> 
> [100,]      17
> 
> 
> So, clearly, this is wrong.  I just would like to find the starting
> position of the first digit, no matter what it is.
> 
> It's probably easy, isn't it?
> 
> Best,
> 
> Jen
> 
> 	[[alternative HTML version deleted]]
> 
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.