[R] Need to compare two columns in two data.frames and return all rows from df where rows values are missing
Jake Elmstedt
j@ke@e|m@tedt @end|ng |rom gm@||@com
Tue Jun 15 21:07:29 CEST 2021
Bert,
There are significant disadvantages to your approach. This,
emphatically, is NOT the proper use case of the merge() function.
set.seed(123)
df1 <- data.frame(name = sample(letters[1:4], 12, TRUE), score = runif(12))
df2 <- data.frame(name = sample(letters[3:6], 12, TRUE), score = runif(12))
out <- merge(df1, df2, by = "name", all = TRUE)
(out <- merge(df1, df2, by = "name", all = TRUE))
#> name score.x score.y
#> 1 a 0.69280341 NA
#> 2 b 0.04205953 NA
#> 3 b 0.32792072 NA
#> 4 b 0.89982497 NA
#> 5 b 0.95450365 NA
#> 6 b 0.99426978 NA
#> 7 c 0.67757064 0.4659625
#> 8 c 0.67757064 0.4137243
#> 9 c 0.67757064 0.7584595
#> 10 c 0.24608773 0.4659625
#> 11 c 0.24608773 0.4137243
#> 12 c 0.24608773 0.7584595
#> 13 c 0.10292468 0.4659625
#> 14 c 0.10292468 0.4137243
#> 15 c 0.10292468 0.7584595
#> 16 c 0.88953932 0.4659625
#> 17 c 0.88953932 0.4137243
#> 18 c 0.88953932 0.7584595
#> 19 c 0.57263340 0.4659625
#> 20 c 0.57263340 0.4137243
#> 21 c 0.57263340 0.7584595
#> 22 d 0.64050681 0.2316258
#> 23 d 0.64050681 0.4145463
#> 24 d 0.64050681 0.2330341
#> 25 d 0.64050681 0.3688455
#> 26 e NA 0.1428000
#> 27 e NA 0.2164079
#> 28 e NA 0.1524447
#> 29 f NA 0.3181810
#> 30 f NA 0.1388061
After doing this merge (assuming there are no edge cases which I'll
discuss at the end) you'd still need to,
Drop the incomplete cases, make a new column of the non-NA values,
and drop the two merged columns,
out2 <- out[!complete.cases(out), ]
(out2[["score"]] <- with(out2, ifelse(is.na(score.x), score.y, score.x)))
#> [1] 0.69280341 0.04205953 0.32792072 0.89982497 0.95450365 0.99426978
#> [7] 0.14280002 0.21640794 0.15244475 0.31818101 0.13880606
(result <- out2[, c(1L, 4L)])
#> name score
#> 1 a 0.69280341
#> 2 b 0.04205953
#> 3 b 0.32792072
#> 4 b 0.89982497
#> 5 b 0.95450365
#> 6 b 0.99426978
#> 26 e 0.14280002
#> 27 e 0.21640794
#> 28 e 0.15244475
#> 29 f 0.31818101
#> 30 f 0.13880606
Now, let's talk about one possible edge case. If either original
data.frame object contains an NA value, that row could be kept
inappropriately, as you're assuming anything with an NA should be
kept.
In short, merge() is inappropriate to use here. You can force it to
(somewhat) work, assuming no complications, but it really is a poor
choice for this task.
Created on 2021-06-15 by the [reprex
package](https://reprex.tidyverse.org) (v2.0.0.9000)
On Mon, Jun 14, 2021 at 7:50 PM Bert Gunter <bgunter.4567 using gmail.com> wrote:
>
> merge(..., all = TRUE) essentially does this for you. rows with NA's are the non-matches:
>
> merge(df1,df2,by = "name",all = TRUE)
> name score.x score.y
> 1 a 0.69280341 NA
> 2 b 0.04205953 NA
> 3 b 0.32792072 NA
> 4 b 0.89982497 NA
> 5 b 0.95450365 NA
> 6 b 0.99426978 NA
> 7 c 0.67757064 0.4659625
> 8 c 0.67757064 0.4137243
> 9 c 0.67757064 0.7584595
> 10 c 0.24608773 0.4659625
> 11 c 0.24608773 0.4137243
> 12 c 0.24608773 0.7584595
> 13 c 0.10292468 0.4659625
> 14 c 0.10292468 0.4137243
> 15 c 0.10292468 0.7584595
> 16 c 0.88953932 0.4659625
> 17 c 0.88953932 0.4137243
> 18 c 0.88953932 0.7584595
> 19 c 0.57263340 0.4659625
> 20 c 0.57263340 0.4137243
> 21 c 0.57263340 0.7584595
> 22 d 0.64050681 0.2316258
> 23 d 0.64050681 0.4145463
> 24 d 0.64050681 0.2330341
> 25 d 0.64050681 0.3688455
> 26 e NA 0.1428000
> 27 e NA 0.2164079
> 28 e NA 0.1524447
> 29 f NA 0.3181810
> 30 f NA 0.1388061
>
> Bert Gunter
>
> "The trouble with having an open mind is that people keep coming along and sticking things into it."
> -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
>
>
> On Mon, Jun 14, 2021 at 7:09 PM Jake Elmstedt <jake.elmstedt using gmail.com> wrote:
>>
>> set.seed(123)
>>
>> df1 <- data.frame(name = sample(letters[1:4], 12, TRUE), score = runif(12))
>> head(df1)
>> #> name score
>> #> 1 c 0.67757064
>> #> 2 c 0.57263340
>> #> 3 c 0.10292468
>> #> 4 b 0.89982497
>> #> 5 c 0.24608773
>> #> 6 b 0.04205953
>> table(df1[["name"]])
>> #>
>> #> a b c d
>> #> 1 5 5 1
>>
>> df2 <- data.frame(name = sample(letters[3:6], 12, TRUE), score = runif(12))
>> head(df2)
>> #> name score
>> #> 1 c 0.7584595
>> #> 2 e 0.2164079
>> #> 3 f 0.3181810
>> #> 4 d 0.2316258
>> #> 5 e 0.1428000
>> #> 6 d 0.4145463
>> table(df2[["name"]])
>> #>
>> #> c d e f
>> #> 3 4 3 2
>>
>> df3 <- rbind(df1[!df1[["name"]] %in% df2[["name"]], ],
>> df2[!df2[["name"]] %in% df1[["name"]], ])
>> head(df3)
>> #> name score
>> #> 4 b 0.89982497
>> #> 6 b 0.04205953
>> #> 7 b 0.32792072
>> #> 8 b 0.95450365
>> #> 10 a 0.69280341
>> #> 12 b 0.99426978
>> table(df3[["name"]])
>> #>
>> #> a b e f
>> #> 1 5 3 2
>>
>> ______________________________________________
>> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
More information about the R-help
mailing list