[R] How to use compare.linkage in RecordLinkage package? -- more details but problem remains
Anders Alexandersson
andersalex at gmail.com
Thu Jan 28 21:01:52 CET 2016
How does one link two datasets using the compare.linkage function in the
RecordLinkage package? This is to follow-up on my original posting earlier
today:
https://stat.ethz.ch/pipermail/r-help/2016-January/435736.html
I suggested then that I should perhaps have added the identity argument.
But if I add the identity argument, then I unexpectedly get 5 matches,
47885 non-matches and 0 pairs with unknown status. For example, I get a
match for row 4256 which is unexpected because the matching variable bm
does not match -- is 0 in the result pair (because bm is 1 for BERND JUNG
and 4 for BERND MUELLER). Also, is_match in row 1 changes from unknown (NA)
to no match (0) which is unexpected since the matching variable bm matches
(bm=1).
Here are the major new R commands that I ran and the output:
> rpairs <- compare.linkage(RLdata500,RLdata10000,blockfld=c(1),
identity1=identity.RLdata500,identity2=identity.RLdata10000,exclude=c(2:5,7))
> subset(rpairs$pairs, is_match=="1") # Why these 5 matches?
id1 id2 fname_c1 bm is_match
4256 59 1394 1 0 1
5811 174 3684 1 0 1
14699 139 4199 1 0 1
16453 92 4580 1 0 1
21840 73 737 1 0 1
> RLdata500[c(17, 59), ] # first obs, and first matching obs
fname_c1 fname_c2 lname_c1 lname_c2 by bm bd
17 ALEXANDER <NA> MUELLER <NA> 1974 9 9
59 BERND <NA> JUNG KLEIN 1935 1 14
> RLdata10000[c(343, 1394), ] # first obs, and first matching obs
fname_c1 fname_c2 lname_c1 lname_c2 by bm bd
343 ALEXANDER <NA> BAUMANN <NA> 1957 9 7
1394 BERND <NA> MUELLER <NA> 1942 4 4
> rpairs$pairs[1:2, ]; # list first 2 obs
id1 id2 fname_c1 bm is_match
1 17 343 1 1 0
2 17 2385 1 0 0
What am I missing? How to probabilistically link two datasets using the
compare.linkage function in the RecordLinkage package?
Anders Alexandersson
andersalex at gmail.com
[[alternative HTML version deleted]]
More information about the R-help
mailing list