[R] Identifying words from a list and code as 0 or 1 and words NOT on the list code as 1
Rui Barradas
ru|pb@rr@d@@ @end|ng |rom @@po@pt
Fri Jun 11 20:03:49 CEST 2021
Hello,
For what I understood of the problem, this might be what you want.
library(dplyr)
library(stringr)
coreWordsPat <- paste0("\\b", coreWords, "\\b")
coreWordsPat <- paste(coreWordsPat, collapse = "|")
left_join(
df %>%
mutate(Core = +str_detect(Utterance, coreWordsPat)) %>%
select(ID, Utterance, Core),
df %>%
mutate(Fringe = str_remove_all(Utterance, coreWordsPat),
Fringe = +(nchar(trimws(Fringe)) > 0)) %>%
select(ID, Fringe),
by = "ID"
)
Hope this helps,
Rui Barradas
Às 18:02 de 11/06/21, Debbie Hahs-Vaughn escreveu:
> I am working with utterances, statements spoken by children. From each utterance, if one or more words in the statement match a predefined list of multiple 'core' words (probably 300 words), then I want to input '1' into 'Core' (and if none, then input '0' into 'Core').
>
> If there are one or more words in the statement that are NOT core words, then I want to input '1' into 'Fringe' (and if there are only core words and nothing extra, then input '0' into 'Fringe'). I will not have a list of Fringe words.
>
> Basically, right now I have a child ID and only the utterances. Here is a snippet of my data.
>
> ID Utterance
> 1 a baby
> 2 small
> 3 yes
> 4 where's his bed
> 5 there's his bed
> 6 where's his pillow
> 7 what is that on his head
> 8 hey he has his arm stuck here
> 9 there there's it
> 10 now you're gonna go night-night
> 11 and that's the thing you can turn on
> 12 yeah where's the music box
> 13 what is this
> 14 small
> 15 there you go baby
>
>
> The following code runs but isn't doing exactly what I need--which is: 1) the ability to detect words from the list and define as core; 2) the ability to search the utterance and if there are any words in the utterance that are NOT core, to identify those as �1� as I will not have a list of fringe words.
>
> ```
>
> library(dplyr)
> library(stringr)
> library(tidyr)
>
> coreWords <-c("I", "no", "yes", "my", "the", "want", "is", "it", "that", "a", "go", "mine", "you", "what", "on", "in", "here", "more", "out", "off", "some", "help", "all done", "finished")
>
> str_detect(df,)
>
> dfplus <- df %>%
> mutate(id = row_number()) %>%
> separate_rows(Utterance, sep = ' ') %>%
> mutate(Core = + str_detect(Utterance, str_c(coreWords, collapse = '|')),
> Fringe = + !Core) %>%
> group_by(id) %>%
> mutate(Core = + (sum(Core) > 0),
> Fringe = + (sum(Fringe) > 0)) %>%
> slice(1) %>%
> select(-Utterance) %>%
> left_join(df) %>%
> ungroup() %>%
> select(Utterance, Core, Fringe, ID)
>
> ```
>
> The dput() code is:
>
> structure(list(Utterance = c("a baby", "small", "yes", "where's his bed",
> "there's his bed", "where's his pillow", "what is that on his head",
> "hey he has his arm stuck here", "there there's it", "now you're gonna go night-night",
> "and that's the thing you can turn on", "yeah where's the music box",
> "what is this", "small", "there you go baby ", "what is this for ",
> "a ", "and the go goodnight here ", "and what is this ", " what's that sound ",
> "what does she say ", "what she say", "should I turn the on so Laura doesn't cry ",
> "what is this ", "what is that ", "where's clothes ", " where's the baby's bedroom ",
> "that might be in dad's bed+room ", "yes ", "there you go baby ",
> "you're welcome "), Core = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
> 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
> 1L, 1L, 1L, 1L, 1L, 1L, 1L), Fringe = c(0L, 0L, 0L, 1L, 1L, 1L,
> 0L, 1L, 0L, 1L, 1L, 1L, 0L, 0L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
> 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), ID = 1:31), row.names = c(NA,
> -31L), class = c("tbl_df", "tbl", "data.frame"))
>
> ```
>
> The first 10 rows of output looks like this:
>
> Utterance Core Fringe ID
> 1 a baby 1 0 1
> 2 small 1 0 2
> 3 yes 1 0 3
> 4 where's his bed 1 1 4
> 5 there's his bed 1 1 5
> 6 where's his pillow 1 1 6
> 7 what is that on his head 1 0 7
> 8 hey he has his arm stuck here 1 1 8
> 9 there there's it 1 0 9
> 10 now you're gonna go night-night 1 1 10
>
> For example, in line 1 of the output, �a� is a core word so �1� for core is correct. However, �baby� should be picked up as fringe so there should be �1�, not �0�, for fringe. Lines 7 and 9 also have words that should be identified as fringe but are not.
>
> Additionally, it seems like if the utterance has parts of a core word in it, it�s being counted. For example, �small� is identified as a core word even though it's not (but 'all done' is a core word). 'Where's his bed' is identified as core and fringe, although none of the words are core.
>
> Any suggestions on what is happening and how to correct it are greatly appreciated.
>
> [[alternative HTML version deleted]]
>
>
> ______________________________________________
> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
More information about the R-help
mailing list