[R] Proper use of grep

Thu Jul 15 23:36:02 CEST 2010

Doran, Harold wrote:
> I just need to confirm something with pattern matching folks. I have
> a factor with the following levels in a very large data set:
> 
>> levels(all$Classical.Statistic)
> [1] ""                        "AB;ABD"
> "CollapsedSteps"          "CR_P"                    "CR_Prop;CR_P;AB"
>  [6] "NMK"                     "NMK;P"                   "NMK;P;ABD"
> "P"                       "ABD" [11] "CR_P;CollapsedSteps"
> "NMK;AB;ABD"              "NMK;ABD"                 "NMK;P;AB"
> "NMK;P;AB;ABD" [16] "AB"                      "CRT;CollapsedSteps"
> "NMK;AB"                  "CR_P;CRT;CollapsedSteps" "CR_Prop;CR_P"
> 
> I need to subset the rows in which the term "CollapsedSteps" appears.
> So, it may appear as "CollapsedSteps" or may appear as
> "CR_P;CRT;CollapsedSteps" as you can see above. I'm using grep as
> follows:
> 
> all[grep('CollapsedSteps', all$Classical.Statistic),]
> 
> to find any row in which the term "'CollapsedSteps" appears. Is this
> certain to catch all cases, or is there an intricacy that I may have
> missed.

Well, just try it for yourself on a data.frame that's small enough to 
verify 'manually'.  For instance, the data.frame that contains each 
level exactly once sounds like a good candidate.

test <- subset(all, !duplicated(Classical.Statistic)

and then try your line of code ...

And do you really want "" as a level, or should those by NA?