[R] detecting any element in a vector of strings, appearing anywhere in any of several character variables in a dataframe

Bert Gunter bgunter.4567 at gmail.com
Sat Jul 11 16:36:16 CEST 2015


Note that John's solution probably includes incorrect partial matches
and that mine fails to match "red" in "this is red." If you change my
proposal to

 sapply(strsplit(do.call(paste,zz[,2:3]),"\\W"), function(x)any(x %in%
alarm.words))

it should agree with Jeff's. Note, however, that you have missed
capital letters:  "Red" would not match "This is red".


Bert Gunter

"Data is not information. Information is not knowledge. And knowledge
is certainly not wisdom."
   -- Clifford Stoll


On Fri, Jul 10, 2015 at 10:54 AM, Christopher W Ryan
<cryan at binghamton.edu> wrote:
> Indeed, the perils of syndromic surveillance with free text.
>
>> with(dd.2, table(fox))
> fox
> FALSE  TRUE
> 74939  1201
>
>> with(dd.2, table(gunter))
> gunter
> FALSE  TRUE
> 75213   927
>
>> with(dd.2, table(newmiller))
> newmiller
> FALSE  TRUE
> 75028  1112
>
>
> Of, course, the simplest thing for me to do would be add "heroine" to
> the alarm.words.  I'm surprised that the US national organization that
> promulgated this list of drug-related terms did not include it. Many
> other common misspellings are included.  I will have to contact them.
>
> --Chris
>
> On Fri, Jul 10, 2015 at 1:39 PM, Bert Gunter <bgunter.4567 at gmail.com> wrote:
>> Yes. This is one of the fundamental challenges in text searching --
>> defining exactly what text defines a match and what doesn't. So,
>> continuing your example, one might imagine that heroin and heroine
>> might both be matches, but maybe heroines shouldn't be (e.g. if the
>> text contains movie reviews). So what one might want to do is add
>> semantic analysis to searches, à la google, a topic way beyond the
>> simple capabilities discussed, or likely needed, here.
>>
>> Incidentally, Jeff Newmiller's (final) regular expression solution is
>> preferable to mine in all respects, I think.
>>
>> -- Bert
>>
>>
>> Bert Gunter
>>
>> "Data is not information. Information is not knowledge. And knowledge
>> is certainly not wisdom."
>>    -- Clifford Stoll
>>
>>
>> On Fri, Jul 10, 2015 at 10:30 AM, Christopher W Ryan
>> <cryan at binghamton.edu> wrote:
>>> Interesting thoughts about the partial-word matches, and speed  On
>>> another real data set, about 73,000 records and 6 columns to search
>>> through for matches (one column of which contains very long character
>>> strings--several paragraphs each), I ran both John's and Bert's
>>> solutions.  John's was noticeably slower, although still quite
>>> tolerable.  There were a different number of matches, though:
>>>
>>>       oic.2
>>> oic          FALSE    TRUE     Sum
>>>   FALSE 74939         0        74939
>>>   TRUE    274           927     1201
>>>   Sum     75213        927     76140
>>>
>>> where oic is the logical vector generated by John's solution, and
>>> oic.2 is the logical vector generated by Bert's solution. Bert's
>>> solution detected about 77% of the cases detected by John's.
>>>
>>> I'm still exploring why that might be. One possible explanation, for
>>> at least part of the difference, is the issue of partial-word matches.
>>> Substantively, I am searching ambulance run records for words related
>>> to opioid overdose, and I've noticed that the medics often spell
>>> heroin as "heroine"  So in this context, I like partial-word
>>> matches--I want to pick up records that (partially) match "heroin"
>>> because it is contained in the word "heroine" .
>>>
>>> There may be other things going on too.
>>>
>>> Thanks.
>>>
>>> --Chris
>>>
>>> On Thu, Jul 9, 2015 at 3:24 PM, John Fox <jfox at mcmaster.ca> wrote:
>>>> Dear Christopher,
>>>>
>>>> My usual orientation to this kind of one-off problem is that I'm looking for a simple correct solution. Computing time is usually much smaller than programming time.
>>>>
>>>> That said, Bert Gunter's solution was about 5 times faster in a simple check that I ran with microbenchmark, and Jeff Newmiller's solution was about 10 times faster. Both Bert's and Jeff's (eventual) solution protect against partial (rather than full-word) matches, while mine doesn't (though it could easily be modified to do that).
>>>>
>>>> Best,
>>>>  John
>>>>
>>>>> -----Original Message-----
>>>>> From: Christopher W Ryan [mailto:cryan at binghamton.edu]
>>>>> Sent: July-09-15 2:49 PM
>>>>> To: Bert Gunter
>>>>> Cc: Jeff Newmiller; R Help; John Fox
>>>>> Subject: Re: [R] detecting any element in a vector of strings, appearing
>>>>> anywhere in any of several character variables in a dataframe
>>>>>
>>>>> Thanks everyone.  John's original solution worked great.  And with
>>>>> 27,000 records, 65 alarm.words, and 6 columns to search, it takes only
>>>>> about 15 seconds.  That is certainly adequate for my needs.  But I
>>>>> will try out the other strategies too.
>>>>>
>>>>> And thanks also for lot's of new R things to learn--grep, grepl,
>>>>> do.call . . . that's always a bonus!
>>>>>
>>>>> --Chris Ryan
>>>>>
>>>>> On Thu, Jul 9, 2015 at 1:52 PM, Bert Gunter <bgunter.4567 at gmail.com>
>>>>> wrote:
>>>>> > Yup, that does it. Let grep figure out what's a word rather than doing
>>>>> > it manually. Forgot about "\b"
>>>>> >
>>>>> > Cheers,
>>>>> > Bert
>>>>> >
>>>>> >
>>>>> > Bert Gunter
>>>>> >
>>>>> > "Data is not information. Information is not knowledge. And knowledge
>>>>> > is certainly not wisdom."
>>>>> >    -- Clifford Stoll
>>>>> >
>>>>> >
>>>>> > On Thu, Jul 9, 2015 at 10:30 AM, Jeff Newmiller
>>>>> > <jdnewmil at dcn.davis.ca.us> wrote:
>>>>> >> Just add a word break marker before and after:
>>>>> >>
>>>>> >> zz$v5 <- grepl( paste0( "\\b(", paste0( alarm.words, collapse="|" ),
>>>>> ")\\b" ), do.call( paste, zz[ , 2:3 ] ) ) )
>>>>> >> ---------------------------------------------------------------------
>>>>> ------
>>>>> >> Jeff Newmiller                        The     .....       .....  Go
>>>>> Live...
>>>>> >> DCN:<jdnewmil at dcn.davis.ca.us>        Basics: ##.#.       ##.#.  Live
>>>>> Go...
>>>>> >>                                       Live:   OO#.. Dead: OO#..
>>>>> Playing
>>>>> >> Research Engineer (Solar/Batteries            O.O#.       #.O#.  with
>>>>> >> /Software/Embedded Controllers)               .OO#.       .OO#.
>>>>> rocks...1k
>>>>> >> ---------------------------------------------------------------------
>>>>> ------
>>>>> >> Sent from my phone. Please excuse my brevity.
>>>>> >>
>>>>> >> On July 9, 2015 10:12:23 AM PDT, Bert Gunter <bgunter.4567 at gmail.com>
>>>>> wrote:
>>>>> >>>Jeff:
>>>>> >>>
>>>>> >>>Well, it would be much better (no loops!) except, I think, for one
>>>>> >>>issue: "red" would match "barred" and I don't think that this is what
>>>>> >>>is wanted: the matches should be on whole "words" not just string
>>>>> >>>patterns.
>>>>> >>>
>>>>> >>>So you would need to fix up the matching pattern to make this work,
>>>>> >>>but it may be a little tricky, as arbitrary whitespace characters,
>>>>> >>>e.g. " " or "\n" etc. could be in the strings to be matched
>>>>> separating
>>>>> >>>the words or ending the "sentence."  I'm sure it can be done, but
>>>>> I'll
>>>>> >>>leave it to you or others to figure it out.
>>>>> >>>
>>>>> >>>Of course, if my diagnosis is wrong or silly, please point this out.
>>>>> >>>
>>>>> >>>Cheers,
>>>>> >>>Bert
>>>>> >>>
>>>>> >>>
>>>>> >>>Bert Gunter
>>>>> >>>
>>>>> >>>"Data is not information. Information is not knowledge. And knowledge
>>>>> >>>is certainly not wisdom."
>>>>> >>>   -- Clifford Stoll
>>>>> >>>
>>>>> >>>
>>>>> >>>On Thu, Jul 9, 2015 at 9:34 AM, Jeff Newmiller
>>>>> >>><jdnewmil at dcn.davis.ca.us> wrote:
>>>>> >>>> I think grep is better suited to this:
>>>>> >>>>
>>>>> >>>> zz$v5 <- grepl( paste0( alarm.words, collapse="|" ), do.call(
>>>>> paste,
>>>>> >>>zz[ , 2:3 ] ) ) )
>>>>> >>>>
>>>>> >>>---------------------------------------------------------------------
>>>>> ------
>>>>> >>>> Jeff Newmiller                        The     .....       .....  Go
>>>>> >>>Live...
>>>>> >>>> DCN:<jdnewmil at dcn.davis.ca.us>        Basics: ##.#.       ##.#.
>>>>> Live
>>>>> >>>Go...
>>>>> >>>>                                       Live:   OO#.. Dead: OO#..
>>>>> >>>Playing
>>>>> >>>> Research Engineer (Solar/Batteries            O.O#.       #.O#.
>>>>> with
>>>>> >>>> /Software/Embedded Controllers)               .OO#.       .OO#.
>>>>> >>>rocks...1k
>>>>> >>>>
>>>>> >>>---------------------------------------------------------------------
>>>>> ------
>>>>> >>>> Sent from my phone. Please excuse my brevity.
>>>>> >>>>
>>>>> >>>> On July 9, 2015 8:51:10 AM PDT, Bert Gunter
>>>>> <bgunter.4567 at gmail.com>
>>>>> >>>wrote:
>>>>> >>>>>Here's a way to do it that uses %in% (i.e. match() ) and uses only
>>>>> a
>>>>> >>>>>single, not a double, loop. It should be more efficient.
>>>>> >>>>>
>>>>> >>>>>> sapply(strsplit(do.call(paste,zz[,2:3]),"[[:space:]]+"),
>>>>> >>>>>+       function(x)any(x %in% alarm.words))
>>>>> >>>>>
>>>>> >>>>> [1] FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE  TRUE  TRUE
>>>>> >>>>>
>>>>> >>>>>The idea is to paste the strings in each row (do.call allows an
>>>>> >>>>>arbitrary number of columns) into a single string and then use
>>>>> >>>>>strsplit to break the string into individual "words" on whitespace.
>>>>> >>>>>Then the matching is vectorized with the any( %in% ... ) call.
>>>>> >>>>>
>>>>> >>>>>Cheers,
>>>>> >>>>>Bert
>>>>> >>>>>Bert Gunter
>>>>> >>>>>
>>>>> >>>>>"Data is not information. Information is not knowledge. And
>>>>> knowledge
>>>>> >>>>>is certainly not wisdom."
>>>>> >>>>>   -- Clifford Stoll
>>>>> >>>>>
>>>>> >>>>>
>>>>> >>>>>On Thu, Jul 9, 2015 at 6:05 AM, John Fox <jfox at mcmaster.ca> wrote:
>>>>> >>>>>> Dear Chris,
>>>>> >>>>>>
>>>>> >>>>>> If I understand correctly what you want, how about the following?
>>>>> >>>>>>
>>>>> >>>>>>> rows <- apply(zz[, 2:3], 1, function(x) any(sapply(alarm.words,
>>>>> >>>>>grepl, x=x)))
>>>>> >>>>>>> zz[rows, ]
>>>>> >>>>>>
>>>>> >>>>>>           v1                              v2                v3 v4
>>>>> >>>>>> 3  -1.022329                    green turtle    ronald weasley  2
>>>>> >>>>>> 6   0.336599              waffle the hamster        red sparks  1
>>>>> >>>>>> 9  -1.631874 yellow giraffe with a long neck gandalf the white  1
>>>>> >>>>>> 10  1.130622                      black bear  gandalf the grey  2
>>>>> >>>>>>
>>>>> >>>>>> I hope this helps,
>>>>> >>>>>>  John
>>>>> >>>>>>
>>>>> >>>>>> ------------------------------------------------
>>>>> >>>>>> John Fox, Professor
>>>>> >>>>>> McMaster University
>>>>> >>>>>> Hamilton, Ontario, Canada
>>>>> >>>>>> http://socserv.mcmaster.ca/jfox/
>>>>> >>>>>>
>>>>> >>>>>>
>>>>> >>>>>> On Wed, 08 Jul 2015 22:23:37 -0400
>>>>> >>>>>>  "Christopher W. Ryan" <cryan at binghamton.edu> wrote:
>>>>> >>>>>>> Running R 3.1.1 on windows 7
>>>>> >>>>>>>
>>>>> >>>>>>> I want to identify as a case any record in a dataframe that
>>>>> >>>contains
>>>>> >>>>>any
>>>>> >>>>>>> of several keywords in any of several variables.
>>>>> >>>>>>>
>>>>> >>>>>>> Example:
>>>>> >>>>>>>
>>>>> >>>>>>> # create a dataframe with 4 variables and 10 records
>>>>> >>>>>>> v2 <- c("white bird", "blue bird", "green turtle", "quick brown
>>>>> >>>>>fox",
>>>>> >>>>>>> "big black dog", "waffle the hamster", "benny likes food a lot",
>>>>> >>>>>"hello
>>>>> >>>>>>> world", "yellow giraffe with a long neck", "black bear")
>>>>> >>>>>>> v3 <- c("harry potter", "hermione grainger", "ronald weasley",
>>>>> >>>>>"ginny
>>>>> >>>>>>> weasley", "dudley dursley", "red sparks", "blue sparks", "white
>>>>> >>>>>dress
>>>>> >>>>>>> robes", "gandalf the white", "gandalf the grey")
>>>>> >>>>>>> zz <- data.frame(v1=rnorm(10), v2=v2, v3=v3, v4=rpois(10,
>>>>> >>>lambda=2),
>>>>> >>>>>>> stringsAsFactors=FALSE)
>>>>> >>>>>>> str(zz)
>>>>> >>>>>>> zz
>>>>> >>>>>>>
>>>>> >>>>>>> # here are the keywords
>>>>> >>>>>>> alarm.words <- c("red", "green", "turtle", "gandalf")
>>>>> >>>>>>>
>>>>> >>>>>>> # For each row/record, I want to test whether the string in v2
>>>>> or
>>>>> >>>>>the
>>>>> >>>>>>> string in v3 contains any of the strings in alarm.words. And
>>>>> then
>>>>> >>>if
>>>>> >>>>>so,
>>>>> >>>>>>> set zz$v5=TRUE for that record.
>>>>> >>>>>>>
>>>>> >>>>>>> # I'm thinking the str_detect function in the stringr package
>>>>> >>>ought
>>>>> >>>>>to
>>>>> >>>>>>> be able to help, perhaps with some use of apply over the rows,
>>>>> but
>>>>> >>>I
>>>>> >>>>>>> obviously misunderstand something about how str_detect works
>>>>> >>>>>>>
>>>>> >>>>>>> library(stringr)
>>>>> >>>>>>>
>>>>> >>>>>>> str_detect(zz[,2:3], alarm.words)    # error: the target of the
>>>>> >>>>>search
>>>>> >>>>>>>                                      # must be a vector, not
>>>>> >>>>>multiple
>>>>> >>>>>>>                                      # columns
>>>>> >>>>>>>
>>>>> >>>>>>> str_detect(zz[1:4,2:3], alarm.words) # same error
>>>>> >>>>>>>
>>>>> >>>>>>> str_detect(zz[,2], alarm.words)      # error, length of
>>>>> >>>alarm.words
>>>>> >>>>>>>                                      # is less than the number
>>>>> of
>>>>> >>>>>>>                                      # rows I am using for the
>>>>> >>>>>>>                                      # comparison
>>>>> >>>>>>>
>>>>> >>>>>>> str_detect(zz[1:4,2], alarm.words)   # works as hoped when
>>>>> >>>>>>> length(alarm.words)                  # confining nrows
>>>>> >>>>>>>                                      # to the length of
>>>>> >>>alarm.words
>>>>> >>>>>>>
>>>>> >>>>>>> str_detect(zz, alarm.words)          # obviously not right
>>>>> >>>>>>>
>>>>> >>>>>>> # maybe I need apply() ?
>>>>> >>>>>>> my.f <- function(x){str_detect(x, alarm.words)}
>>>>> >>>>>>>
>>>>> >>>>>>> apply(zz[,2], 1, my.f)     # again, a mismatch in lengths
>>>>> >>>>>>>                            # between alarm.words and that
>>>>> >>>>>>>                            # in which I am searching for
>>>>> >>>>>>>                            # matching strings
>>>>> >>>>>>>
>>>>> >>>>>>> apply(zz, 2, my.f)         # now I'm getting somewhere
>>>>> >>>>>>> apply(zz[1:4,], 2, my.f)   # but still only works with 4
>>>>> >>>>>>>                            # rows of the dataframe
>>>>> >>>>>>>
>>>>> >>>>>>>
>>>>> >>>>>>> # perhaps %in% could do the job?
>>>>> >>>>>>>
>>>>> >>>>>>> Appreciate any advice.
>>>>> >>>>>>>
>>>>> >>>>>>> --Chris Ryan
>>>>> >>>>>>>
>>>>> >>>>>>> ______________________________________________
>>>>> >>>>>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more,
>>>>> see
>>>>> >>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>> >>>>>>> PLEASE do read the posting guide
>>>>> >>>>>http://www.R-project.org/posting-guide.html
>>>>> >>>>>>> and provide commented, minimal, self-contained, reproducible
>>>>> code.
>>>>> >>>>>>
>>>>> >>>>>> ______________________________________________
>>>>> >>>>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>>>>> >>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>> >>>>>> PLEASE do read the posting guide
>>>>> >>>>>http://www.R-project.org/posting-guide.html
>>>>> >>>>>> and provide commented, minimal, self-contained, reproducible
>>>>> code.
>>>>> >>>>>
>>>>> >>>>>______________________________________________
>>>>> >>>>>R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>>>>> >>>>>https://stat.ethz.ch/mailman/listinfo/r-help
>>>>> >>>>>PLEASE do read the posting guide
>>>>> >>>>>http://www.R-project.org/posting-guide.html
>>>>> >>>>>and provide commented, minimal, self-contained, reproducible code.
>>>>> >>>>
>>>>> >>
>>>>> >
>>>>> > ______________________________________________
>>>>> > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>>>>> > https://stat.ethz.ch/mailman/listinfo/r-help
>>>>> > PLEASE do read the posting guide http://www.R-project.org/posting-
>>>>> guide.html
>>>>> > and provide commented, minimal, self-contained, reproducible code.
>>>>
>>>>
>>>> ---
>>>> This email has been checked for viruses by Avast antivirus software.
>>>> https://www.avast.com/antivirus
>>>>



More information about the R-help mailing list