[R] Merge data frame with mispelling characters
VictorDelgado
victor.maia at fjp.mg.gov.br
Fri Nov 2 20:50:02 CET 2012
David Winsemius wrote
> On Nov 2, 2012, at 11:20 AM, VictorDelgado wrote:
>
>> Hello dear R-helpers,
>>
>> I'm working with R-2.15.2 on Windows 7 OS. I'm stucked with a merge of
>> two
>> data frames by characters.
>> In each data frame I got two different list of names, that is my main-key
>> to
>> be merged.
>>
>> To figure out what I'm saying, I build up a modified "?merge" example,
>> with
>> errors by purpose:
>>
>> # Data for authors:
>>
>> authors <- data.frame(
>> surname = I(c("Tukey", "Venable", "Terney", "Ripley", "McNeil")),
>> nationality = c("US", "Australia", "US", "UK", "Australia"),
>> deceased = c("yes", rep("no", 4)))
>>
>> "Venables" is without the final 's', and "Tierney, without "i".
>>
>> # Data for books:
>>
>> books <- data.frame(
>> surname = I(c("Tukey", "Venables", "Tierney",
>> "Ripley", "Rippley", "McNeil", "R Core")),
>> title = c("Exploratory Data Analysis",
>> "Modern Applied Statistics ...",
>> "LISP-STAT",
>> "Spatial Statistics", "Stochastic Simulation",
>> "Interactive Data Analysis",
>> "An Introduction to R"),
>> other.author = c(NA, "Ripley", NA, NA, NA, NA,
>> "Venables & Smith"))
>
> In your example the authors list has better spelling. The 'agrep'
> functions by default will return matches that are 90% ( or more precisely
> Levenshtein distance of less than or equalt to 0.1) :
>
>
> books$altname <- NA
> altidx <- unlist( sapply(books$surname, agrep, authors$surname) )
> books$altname[seq(altidx)] <- authors$surname[altidx]
> books
> #-----------------------
> surname title other.author altname
> 1 Tukey Exploratory Data Analysis
> <NA>
> Tukey
> 2 Venables Modern Applied Statistics ... Ripley Venable
> 3 Tierney LISP-STAT
> <NA>
> Terney
> 4 Ripley Spatial Statistics
> <NA>
> Ripley
> 5 Rippley Stochastic Simulation
> <NA>
> Ripley
> 6 McNeil Interactive Data Analysis
> <NA>
> McNeil
> 7 R Core An Introduction to R Venables & Smith
> <NA>
> If you then match 'books' to 'authors' with a merge on authors$surname and
> books$altname, you should get closer to your goals
>
> --
> David.
>>
>> With "surname" column instead of "name" (differs from original example
>> for
>> more easy going merge). And the second "Ripley" with double "p".
>>
>> So, if I ask for:
>>
>> merge(authors, books, all=TRUE)
>>
>> I got:
>>
>>
>> But we know that "Rippley" corresponds to "Ripley", "Terney" to "Tierney"
>> and "Venable" to "Venables". I was wondering if there was any way to work
>> around this problem. My orginal data have around 27,000 name entries, and
>> if
>> I take "all=FALSE", this database drops out to around 17,000, most
>> because
>> mispelling (or truncated expressions). If I take "all=TRUE", I got many
>> of
>> this
> <NA>
> cases like the example above.
>>
>> Has anyone experienced this? Any idea how I can get out? I'm thinking to
>> take the longest match possible to each entry. For example, in
>> "Venable"/"Venables" there is a 87.5% match. As I have name and surname,
>> and
>> also auxiliary keys to this match, I think this could work.
>>
>> Thank you in advance.
>>
>>
>>
>> -----
>> Victor Delgado
>> cedeplar.ufmg.br P.H.D. student
>> www.fjp.mg.gov.br reseacher
>> --
>> View this message in context:
>> http://r.789695.n4.nabble.com/Merge-data-frame-with-mispelling-characters-tp4648255.html
>> Sent from the R help mailing list archive at Nabble.com.
>>
>> ______________________________________________
>>
> R-help@
> mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>
> David Winsemius, MD
> Alameda, CA, USA
>
> ______________________________________________
> R-help@
> mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
It's amazing to get such useful answers so fast. I did not know the
RecordLinkage package, it looks very sophisticated and useful for this kind
of demand. I just made some tests and I think it could be very useful.
I'm working with portuguese spelling names, so I will also test agrep and
see what function returns better results, giving less data loss.
Thank you a lot, Jim Holtman and also David Winsemius.
-----
Victor Delgado
cedeplar.ufmg.br P.H.D. student
UFOP assistant professor
--
View this message in context: http://r.789695.n4.nabble.com/Merge-data-frame-with-mispelling-characters-tp4648255p4648266.html
Sent from the R help mailing list archive at Nabble.com.
More information about the R-help
mailing list