[R] detecting any element in a vector of strings, appearing anywhere in any of several character variables in a dataframe
Bert Gunter
bgunter.4567 at gmail.com
Thu Jul 9 17:51:10 CEST 2015
Here's a way to do it that uses %in% (i.e. match() ) and uses only a
single, not a double, loop. It should be more efficient.
> sapply(strsplit(do.call(paste,zz[,2:3]),"[[:space:]]+"),
+ function(x)any(x %in% alarm.words))
[1] FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE TRUE TRUE
The idea is to paste the strings in each row (do.call allows an
arbitrary number of columns) into a single string and then use
strsplit to break the string into individual "words" on whitespace.
Then the matching is vectorized with the any( %in% ... ) call.
Cheers,
Bert
Bert Gunter
"Data is not information. Information is not knowledge. And knowledge
is certainly not wisdom."
-- Clifford Stoll
On Thu, Jul 9, 2015 at 6:05 AM, John Fox <jfox at mcmaster.ca> wrote:
> Dear Chris,
>
> If I understand correctly what you want, how about the following?
>
>> rows <- apply(zz[, 2:3], 1, function(x) any(sapply(alarm.words, grepl, x=x)))
>> zz[rows, ]
>
> v1 v2 v3 v4
> 3 -1.022329 green turtle ronald weasley 2
> 6 0.336599 waffle the hamster red sparks 1
> 9 -1.631874 yellow giraffe with a long neck gandalf the white 1
> 10 1.130622 black bear gandalf the grey 2
>
> I hope this helps,
> John
>
> ------------------------------------------------
> John Fox, Professor
> McMaster University
> Hamilton, Ontario, Canada
> http://socserv.mcmaster.ca/jfox/
>
>
> On Wed, 08 Jul 2015 22:23:37 -0400
> "Christopher W. Ryan" <cryan at binghamton.edu> wrote:
>> Running R 3.1.1 on windows 7
>>
>> I want to identify as a case any record in a dataframe that contains any
>> of several keywords in any of several variables.
>>
>> Example:
>>
>> # create a dataframe with 4 variables and 10 records
>> v2 <- c("white bird", "blue bird", "green turtle", "quick brown fox",
>> "big black dog", "waffle the hamster", "benny likes food a lot", "hello
>> world", "yellow giraffe with a long neck", "black bear")
>> v3 <- c("harry potter", "hermione grainger", "ronald weasley", "ginny
>> weasley", "dudley dursley", "red sparks", "blue sparks", "white dress
>> robes", "gandalf the white", "gandalf the grey")
>> zz <- data.frame(v1=rnorm(10), v2=v2, v3=v3, v4=rpois(10, lambda=2),
>> stringsAsFactors=FALSE)
>> str(zz)
>> zz
>>
>> # here are the keywords
>> alarm.words <- c("red", "green", "turtle", "gandalf")
>>
>> # For each row/record, I want to test whether the string in v2 or the
>> string in v3 contains any of the strings in alarm.words. And then if so,
>> set zz$v5=TRUE for that record.
>>
>> # I'm thinking the str_detect function in the stringr package ought to
>> be able to help, perhaps with some use of apply over the rows, but I
>> obviously misunderstand something about how str_detect works
>>
>> library(stringr)
>>
>> str_detect(zz[,2:3], alarm.words) # error: the target of the search
>> # must be a vector, not multiple
>> # columns
>>
>> str_detect(zz[1:4,2:3], alarm.words) # same error
>>
>> str_detect(zz[,2], alarm.words) # error, length of alarm.words
>> # is less than the number of
>> # rows I am using for the
>> # comparison
>>
>> str_detect(zz[1:4,2], alarm.words) # works as hoped when
>> length(alarm.words) # confining nrows
>> # to the length of alarm.words
>>
>> str_detect(zz, alarm.words) # obviously not right
>>
>> # maybe I need apply() ?
>> my.f <- function(x){str_detect(x, alarm.words)}
>>
>> apply(zz[,2], 1, my.f) # again, a mismatch in lengths
>> # between alarm.words and that
>> # in which I am searching for
>> # matching strings
>>
>> apply(zz, 2, my.f) # now I'm getting somewhere
>> apply(zz[1:4,], 2, my.f) # but still only works with 4
>> # rows of the dataframe
>>
>>
>> # perhaps %in% could do the job?
>>
>> Appreciate any advice.
>>
>> --Chris Ryan
>>
>> ______________________________________________
>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
More information about the R-help
mailing list