[R] detecting any element in a vector of strings, appearing anywhere in any of several character variables in a dataframe
Bert Gunter
bgunter.4567 at gmail.com
Sat Jul 11 16:36:16 CEST 2015
Note that John's solution probably includes incorrect partial matches
and that mine fails to match "red" in "this is red." If you change my
proposal to
sapply(strsplit(do.call(paste,zz[,2:3]),"\\W"), function(x)any(x %in%
alarm.words))
it should agree with Jeff's. Note, however, that you have missed
capital letters: "Red" would not match "This is red".
Bert Gunter
"Data is not information. Information is not knowledge. And knowledge
is certainly not wisdom."
-- Clifford Stoll
On Fri, Jul 10, 2015 at 10:54 AM, Christopher W Ryan
<cryan at binghamton.edu> wrote:
> Indeed, the perils of syndromic surveillance with free text.
>
>> with(dd.2, table(fox))
> fox
> FALSE TRUE
> 74939 1201
>
>> with(dd.2, table(gunter))
> gunter
> FALSE TRUE
> 75213 927
>
>> with(dd.2, table(newmiller))
> newmiller
> FALSE TRUE
> 75028 1112
>
>
> Of, course, the simplest thing for me to do would be add "heroine" to
> the alarm.words. I'm surprised that the US national organization that
> promulgated this list of drug-related terms did not include it. Many
> other common misspellings are included. I will have to contact them.
>
> --Chris
>
> On Fri, Jul 10, 2015 at 1:39 PM, Bert Gunter <bgunter.4567 at gmail.com> wrote:
>> Yes. This is one of the fundamental challenges in text searching --
>> defining exactly what text defines a match and what doesn't. So,
>> continuing your example, one might imagine that heroin and heroine
>> might both be matches, but maybe heroines shouldn't be (e.g. if the
>> text contains movie reviews). So what one might want to do is add
>> semantic analysis to searches, à la google, a topic way beyond the
>> simple capabilities discussed, or likely needed, here.
>>
>> Incidentally, Jeff Newmiller's (final) regular expression solution is
>> preferable to mine in all respects, I think.
>>
>> -- Bert
>>
>>
>> Bert Gunter
>>
>> "Data is not information. Information is not knowledge. And knowledge
>> is certainly not wisdom."
>> -- Clifford Stoll
>>
>>
>> On Fri, Jul 10, 2015 at 10:30 AM, Christopher W Ryan
>> <cryan at binghamton.edu> wrote:
>>> Interesting thoughts about the partial-word matches, and speed On
>>> another real data set, about 73,000 records and 6 columns to search
>>> through for matches (one column of which contains very long character
>>> strings--several paragraphs each), I ran both John's and Bert's
>>> solutions. John's was noticeably slower, although still quite
>>> tolerable. There were a different number of matches, though:
>>>
>>> oic.2
>>> oic FALSE TRUE Sum
>>> FALSE 74939 0 74939
>>> TRUE 274 927 1201
>>> Sum 75213 927 76140
>>>
>>> where oic is the logical vector generated by John's solution, and
>>> oic.2 is the logical vector generated by Bert's solution. Bert's
>>> solution detected about 77% of the cases detected by John's.
>>>
>>> I'm still exploring why that might be. One possible explanation, for
>>> at least part of the difference, is the issue of partial-word matches.
>>> Substantively, I am searching ambulance run records for words related
>>> to opioid overdose, and I've noticed that the medics often spell
>>> heroin as "heroine" So in this context, I like partial-word
>>> matches--I want to pick up records that (partially) match "heroin"
>>> because it is contained in the word "heroine" .
>>>
>>> There may be other things going on too.
>>>
>>> Thanks.
>>>
>>> --Chris
>>>
>>> On Thu, Jul 9, 2015 at 3:24 PM, John Fox <jfox at mcmaster.ca> wrote:
>>>> Dear Christopher,
>>>>
>>>> My usual orientation to this kind of one-off problem is that I'm looking for a simple correct solution. Computing time is usually much smaller than programming time.
>>>>
>>>> That said, Bert Gunter's solution was about 5 times faster in a simple check that I ran with microbenchmark, and Jeff Newmiller's solution was about 10 times faster. Both Bert's and Jeff's (eventual) solution protect against partial (rather than full-word) matches, while mine doesn't (though it could easily be modified to do that).
>>>>
>>>> Best,
>>>> John
>>>>
>>>>> -----Original Message-----
>>>>> From: Christopher W Ryan [mailto:cryan at binghamton.edu]
>>>>> Sent: July-09-15 2:49 PM
>>>>> To: Bert Gunter
>>>>> Cc: Jeff Newmiller; R Help; John Fox
>>>>> Subject: Re: [R] detecting any element in a vector of strings, appearing
>>>>> anywhere in any of several character variables in a dataframe
>>>>>
>>>>> Thanks everyone. John's original solution worked great. And with
>>>>> 27,000 records, 65 alarm.words, and 6 columns to search, it takes only
>>>>> about 15 seconds. That is certainly adequate for my needs. But I
>>>>> will try out the other strategies too.
>>>>>
>>>>> And thanks also for lot's of new R things to learn--grep, grepl,
>>>>> do.call . . . that's always a bonus!
>>>>>
>>>>> --Chris Ryan
>>>>>
>>>>> On Thu, Jul 9, 2015 at 1:52 PM, Bert Gunter <bgunter.4567 at gmail.com>
>>>>> wrote:
>>>>> > Yup, that does it. Let grep figure out what's a word rather than doing
>>>>> > it manually. Forgot about "\b"
>>>>> >
>>>>> > Cheers,
>>>>> > Bert
>>>>> >
>>>>> >
>>>>> > Bert Gunter
>>>>> >
>>>>> > "Data is not information. Information is not knowledge. And knowledge
>>>>> > is certainly not wisdom."
>>>>> > -- Clifford Stoll
>>>>> >
>>>>> >
>>>>> > On Thu, Jul 9, 2015 at 10:30 AM, Jeff Newmiller
>>>>> > <jdnewmil at dcn.davis.ca.us> wrote:
>>>>> >> Just add a word break marker before and after:
>>>>> >>
>>>>> >> zz$v5 <- grepl( paste0( "\\b(", paste0( alarm.words, collapse="|" ),
>>>>> ")\\b" ), do.call( paste, zz[ , 2:3 ] ) ) )
>>>>> >> ---------------------------------------------------------------------
>>>>> ------
>>>>> >> Jeff Newmiller The ..... ..... Go
>>>>> Live...
>>>>> >> DCN:<jdnewmil at dcn.davis.ca.us> Basics: ##.#. ##.#. Live
>>>>> Go...
>>>>> >> Live: OO#.. Dead: OO#..
>>>>> Playing
>>>>> >> Research Engineer (Solar/Batteries O.O#. #.O#. with
>>>>> >> /Software/Embedded Controllers) .OO#. .OO#.
>>>>> rocks...1k
>>>>> >> ---------------------------------------------------------------------
>>>>> ------
>>>>> >> Sent from my phone. Please excuse my brevity.
>>>>> >>
>>>>> >> On July 9, 2015 10:12:23 AM PDT, Bert Gunter <bgunter.4567 at gmail.com>
>>>>> wrote:
>>>>> >>>Jeff:
>>>>> >>>
>>>>> >>>Well, it would be much better (no loops!) except, I think, for one
>>>>> >>>issue: "red" would match "barred" and I don't think that this is what
>>>>> >>>is wanted: the matches should be on whole "words" not just string
>>>>> >>>patterns.
>>>>> >>>
>>>>> >>>So you would need to fix up the matching pattern to make this work,
>>>>> >>>but it may be a little tricky, as arbitrary whitespace characters,
>>>>> >>>e.g. " " or "\n" etc. could be in the strings to be matched
>>>>> separating
>>>>> >>>the words or ending the "sentence." I'm sure it can be done, but
>>>>> I'll
>>>>> >>>leave it to you or others to figure it out.
>>>>> >>>
>>>>> >>>Of course, if my diagnosis is wrong or silly, please point this out.
>>>>> >>>
>>>>> >>>Cheers,
>>>>> >>>Bert
>>>>> >>>
>>>>> >>>
>>>>> >>>Bert Gunter
>>>>> >>>
>>>>> >>>"Data is not information. Information is not knowledge. And knowledge
>>>>> >>>is certainly not wisdom."
>>>>> >>> -- Clifford Stoll
>>>>> >>>
>>>>> >>>
>>>>> >>>On Thu, Jul 9, 2015 at 9:34 AM, Jeff Newmiller
>>>>> >>><jdnewmil at dcn.davis.ca.us> wrote:
>>>>> >>>> I think grep is better suited to this:
>>>>> >>>>
>>>>> >>>> zz$v5 <- grepl( paste0( alarm.words, collapse="|" ), do.call(
>>>>> paste,
>>>>> >>>zz[ , 2:3 ] ) ) )
>>>>> >>>>
>>>>> >>>---------------------------------------------------------------------
>>>>> ------
>>>>> >>>> Jeff Newmiller The ..... ..... Go
>>>>> >>>Live...
>>>>> >>>> DCN:<jdnewmil at dcn.davis.ca.us> Basics: ##.#. ##.#.
>>>>> Live
>>>>> >>>Go...
>>>>> >>>> Live: OO#.. Dead: OO#..
>>>>> >>>Playing
>>>>> >>>> Research Engineer (Solar/Batteries O.O#. #.O#.
>>>>> with
>>>>> >>>> /Software/Embedded Controllers) .OO#. .OO#.
>>>>> >>>rocks...1k
>>>>> >>>>
>>>>> >>>---------------------------------------------------------------------
>>>>> ------
>>>>> >>>> Sent from my phone. Please excuse my brevity.
>>>>> >>>>
>>>>> >>>> On July 9, 2015 8:51:10 AM PDT, Bert Gunter
>>>>> <bgunter.4567 at gmail.com>
>>>>> >>>wrote:
>>>>> >>>>>Here's a way to do it that uses %in% (i.e. match() ) and uses only
>>>>> a
>>>>> >>>>>single, not a double, loop. It should be more efficient.
>>>>> >>>>>
>>>>> >>>>>> sapply(strsplit(do.call(paste,zz[,2:3]),"[[:space:]]+"),
>>>>> >>>>>+ function(x)any(x %in% alarm.words))
>>>>> >>>>>
>>>>> >>>>> [1] FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE TRUE TRUE
>>>>> >>>>>
>>>>> >>>>>The idea is to paste the strings in each row (do.call allows an
>>>>> >>>>>arbitrary number of columns) into a single string and then use
>>>>> >>>>>strsplit to break the string into individual "words" on whitespace.
>>>>> >>>>>Then the matching is vectorized with the any( %in% ... ) call.
>>>>> >>>>>
>>>>> >>>>>Cheers,
>>>>> >>>>>Bert
>>>>> >>>>>Bert Gunter
>>>>> >>>>>
>>>>> >>>>>"Data is not information. Information is not knowledge. And
>>>>> knowledge
>>>>> >>>>>is certainly not wisdom."
>>>>> >>>>> -- Clifford Stoll
>>>>> >>>>>
>>>>> >>>>>
>>>>> >>>>>On Thu, Jul 9, 2015 at 6:05 AM, John Fox <jfox at mcmaster.ca> wrote:
>>>>> >>>>>> Dear Chris,
>>>>> >>>>>>
>>>>> >>>>>> If I understand correctly what you want, how about the following?
>>>>> >>>>>>
>>>>> >>>>>>> rows <- apply(zz[, 2:3], 1, function(x) any(sapply(alarm.words,
>>>>> >>>>>grepl, x=x)))
>>>>> >>>>>>> zz[rows, ]
>>>>> >>>>>>
>>>>> >>>>>> v1 v2 v3 v4
>>>>> >>>>>> 3 -1.022329 green turtle ronald weasley 2
>>>>> >>>>>> 6 0.336599 waffle the hamster red sparks 1
>>>>> >>>>>> 9 -1.631874 yellow giraffe with a long neck gandalf the white 1
>>>>> >>>>>> 10 1.130622 black bear gandalf the grey 2
>>>>> >>>>>>
>>>>> >>>>>> I hope this helps,
>>>>> >>>>>> John
>>>>> >>>>>>
>>>>> >>>>>> ------------------------------------------------
>>>>> >>>>>> John Fox, Professor
>>>>> >>>>>> McMaster University
>>>>> >>>>>> Hamilton, Ontario, Canada
>>>>> >>>>>> http://socserv.mcmaster.ca/jfox/
>>>>> >>>>>>
>>>>> >>>>>>
>>>>> >>>>>> On Wed, 08 Jul 2015 22:23:37 -0400
>>>>> >>>>>> "Christopher W. Ryan" <cryan at binghamton.edu> wrote:
>>>>> >>>>>>> Running R 3.1.1 on windows 7
>>>>> >>>>>>>
>>>>> >>>>>>> I want to identify as a case any record in a dataframe that
>>>>> >>>contains
>>>>> >>>>>any
>>>>> >>>>>>> of several keywords in any of several variables.
>>>>> >>>>>>>
>>>>> >>>>>>> Example:
>>>>> >>>>>>>
>>>>> >>>>>>> # create a dataframe with 4 variables and 10 records
>>>>> >>>>>>> v2 <- c("white bird", "blue bird", "green turtle", "quick brown
>>>>> >>>>>fox",
>>>>> >>>>>>> "big black dog", "waffle the hamster", "benny likes food a lot",
>>>>> >>>>>"hello
>>>>> >>>>>>> world", "yellow giraffe with a long neck", "black bear")
>>>>> >>>>>>> v3 <- c("harry potter", "hermione grainger", "ronald weasley",
>>>>> >>>>>"ginny
>>>>> >>>>>>> weasley", "dudley dursley", "red sparks", "blue sparks", "white
>>>>> >>>>>dress
>>>>> >>>>>>> robes", "gandalf the white", "gandalf the grey")
>>>>> >>>>>>> zz <- data.frame(v1=rnorm(10), v2=v2, v3=v3, v4=rpois(10,
>>>>> >>>lambda=2),
>>>>> >>>>>>> stringsAsFactors=FALSE)
>>>>> >>>>>>> str(zz)
>>>>> >>>>>>> zz
>>>>> >>>>>>>
>>>>> >>>>>>> # here are the keywords
>>>>> >>>>>>> alarm.words <- c("red", "green", "turtle", "gandalf")
>>>>> >>>>>>>
>>>>> >>>>>>> # For each row/record, I want to test whether the string in v2
>>>>> or
>>>>> >>>>>the
>>>>> >>>>>>> string in v3 contains any of the strings in alarm.words. And
>>>>> then
>>>>> >>>if
>>>>> >>>>>so,
>>>>> >>>>>>> set zz$v5=TRUE for that record.
>>>>> >>>>>>>
>>>>> >>>>>>> # I'm thinking the str_detect function in the stringr package
>>>>> >>>ought
>>>>> >>>>>to
>>>>> >>>>>>> be able to help, perhaps with some use of apply over the rows,
>>>>> but
>>>>> >>>I
>>>>> >>>>>>> obviously misunderstand something about how str_detect works
>>>>> >>>>>>>
>>>>> >>>>>>> library(stringr)
>>>>> >>>>>>>
>>>>> >>>>>>> str_detect(zz[,2:3], alarm.words) # error: the target of the
>>>>> >>>>>search
>>>>> >>>>>>> # must be a vector, not
>>>>> >>>>>multiple
>>>>> >>>>>>> # columns
>>>>> >>>>>>>
>>>>> >>>>>>> str_detect(zz[1:4,2:3], alarm.words) # same error
>>>>> >>>>>>>
>>>>> >>>>>>> str_detect(zz[,2], alarm.words) # error, length of
>>>>> >>>alarm.words
>>>>> >>>>>>> # is less than the number
>>>>> of
>>>>> >>>>>>> # rows I am using for the
>>>>> >>>>>>> # comparison
>>>>> >>>>>>>
>>>>> >>>>>>> str_detect(zz[1:4,2], alarm.words) # works as hoped when
>>>>> >>>>>>> length(alarm.words) # confining nrows
>>>>> >>>>>>> # to the length of
>>>>> >>>alarm.words
>>>>> >>>>>>>
>>>>> >>>>>>> str_detect(zz, alarm.words) # obviously not right
>>>>> >>>>>>>
>>>>> >>>>>>> # maybe I need apply() ?
>>>>> >>>>>>> my.f <- function(x){str_detect(x, alarm.words)}
>>>>> >>>>>>>
>>>>> >>>>>>> apply(zz[,2], 1, my.f) # again, a mismatch in lengths
>>>>> >>>>>>> # between alarm.words and that
>>>>> >>>>>>> # in which I am searching for
>>>>> >>>>>>> # matching strings
>>>>> >>>>>>>
>>>>> >>>>>>> apply(zz, 2, my.f) # now I'm getting somewhere
>>>>> >>>>>>> apply(zz[1:4,], 2, my.f) # but still only works with 4
>>>>> >>>>>>> # rows of the dataframe
>>>>> >>>>>>>
>>>>> >>>>>>>
>>>>> >>>>>>> # perhaps %in% could do the job?
>>>>> >>>>>>>
>>>>> >>>>>>> Appreciate any advice.
>>>>> >>>>>>>
>>>>> >>>>>>> --Chris Ryan
>>>>> >>>>>>>
>>>>> >>>>>>> ______________________________________________
>>>>> >>>>>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more,
>>>>> see
>>>>> >>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>> >>>>>>> PLEASE do read the posting guide
>>>>> >>>>>http://www.R-project.org/posting-guide.html
>>>>> >>>>>>> and provide commented, minimal, self-contained, reproducible
>>>>> code.
>>>>> >>>>>>
>>>>> >>>>>> ______________________________________________
>>>>> >>>>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>>>>> >>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>> >>>>>> PLEASE do read the posting guide
>>>>> >>>>>http://www.R-project.org/posting-guide.html
>>>>> >>>>>> and provide commented, minimal, self-contained, reproducible
>>>>> code.
>>>>> >>>>>
>>>>> >>>>>______________________________________________
>>>>> >>>>>R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>>>>> >>>>>https://stat.ethz.ch/mailman/listinfo/r-help
>>>>> >>>>>PLEASE do read the posting guide
>>>>> >>>>>http://www.R-project.org/posting-guide.html
>>>>> >>>>>and provide commented, minimal, self-contained, reproducible code.
>>>>> >>>>
>>>>> >>
>>>>> >
>>>>> > ______________________________________________
>>>>> > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>>>>> > https://stat.ethz.ch/mailman/listinfo/r-help
>>>>> > PLEASE do read the posting guide http://www.R-project.org/posting-
>>>>> guide.html
>>>>> > and provide commented, minimal, self-contained, reproducible code.
>>>>
>>>>
>>>> ---
>>>> This email has been checked for viruses by Avast antivirus software.
>>>> https://www.avast.com/antivirus
>>>>
More information about the R-help
mailing list