A few months ago, I need to synchronize our groupware's address book with our employee database. Since the address book provded only minimal data about employees, and the employee database didn't contain any of the exact fields in the address book, I had to synchronize them with nothing more than first and last names.
This is not as easy as it sounds. It quickly became apparent that only a small portion of the >3500 records would match directly. A simple SQL join between imported tables would not work; the data was too inconsistent. After matching a few names manually, I became increasingly obsessed with the problem of matching names: identifying that "Bill Gates" was the same person as "William Gates III".
The problem was tenacious: I would add processing to catch misspellings or hyphenations, and new issues would come up. The script quickly grew into a module, Lingua::EN::MatchNames, and then a second, Lingua::EN::Nickname, in response to the bizarre and arbitrary system of shortening of first names. Many first name forms have little or no similarity whatsoever: Peggy = Margaret = Midge; and several can follow an almost endless mutation path: Peggy Margaret Martha Mary Maryanne Anna Roseanne Rosalyn Linda Melinda ...).
When the initial versions were complete, my script was able to match the vast majority of the records on its own (with greater than 85% certainty per match), and most of the rest either had no real world match, or were suggested by the highest confidence ranking of the script. While this is by no means a statistically significant sampling, it is extremely encouraging.
Once the databases had been matched, I had convenient test data for the modules, since they could now check their work. This allowed me to emphasize successful test methods, and de-emphasize ones that provided too many false positives.
This article will show you how to use these modules, and explains a little about how they work.
Lingua::EN::MatchNames has several dependencies: Lingua::EN::NameParse, Lingua::EN::Nickname, String::Approx, Text::Metaphone, and Text::Soundex. Kim Ryan's Lingua::EN::NameParse, in turn, requires Parse::RecDescent. Lingua::EN::Nickname has no dependencies.
Although they aren't bundled, your CPAN module should be able to follow the dependencies:
% perl -MCPAN -e "install('Lingua::EN::MatchNames')"
Windows users without a compiler will likely need to first use PPM to retrieve binaries of two of the requisites:
ppm install String-Approx Text-Metaphone
The main module, Lingua::EN::MatchNames, exports one function by default: name_eq(). You can either feed it four parameters:
name_eq( $firstname0, $lastname0, $firstname1, $lastname1 )
or two (thanks to Lingua::EN::NameParse, which breaks full names into their constituent components):
name_eq( $name0, $name1 )
and it will return a certainty score between 0 and 100, or undef if the names cannot be matched via any method known to the module.
If you ask for them, Lingua::EN::MatchNames will also export fname_eq() or lname_eq(), for matching first and last names. Both take two parameters, and each returns a certainty score between 0 and 100.
Lingua::EN::Nickname exports nickname_eq() for matching first names solely on the basis of nicknames, and nickroot(), which attempts to look up the full (formal) first name(s), given a nickname. If asked, it will also export nickmatch(), which returns a regex for matching all known full names given a nickname; or nickfollow(), which recursively searches for a path of related names joining the two names passed to it, and returns the number of ‘hops' between the two. In practice, unless you have a specific need, you will probably use Lingua::EN::Nickname only indirectly, through Lingua::EN::MatchNames.
These modules were designed for completeness over speed or size. You can expect that matching thousands of records will take hours when testing every possible match.
Typically, your script will build an array of [ uniqueid, firstname, lastname ] elements from the first database, then iterate over the records in the second database, collecting matches as shown in Listing 1.
Listing 1 assumes that your names are in a file with some ID number, a first name, and then a last name, all separated with tabs. Obviously, since the functions exported by a module accept simple strings, you can extract your names from anything you like, or even the command line.
name_eq() simply requires that some certainty exist for both first and last names, and returns a score that combines the two, weighting the last name more heavily (70%).
fname_eq() matches first names like so:
1. Simple equality is tested.
Trivial matches return a certainty of 100.
2. Informal names, delimited by parentheses or quotes, are recursed.
Portions of the name wrapped in parentheses or quotes are recursively checked against the potentially matching name with fname_eq(). "William (Chip)" would therefore match "Chip", with full certainty of the best submatch.
3. Extraneous initials are removed.
"H. Ross" matches "Ross" with a high certainty level; "Ross H." would also match at this stage. This step must be performed before the next step to avoid trivially poor chunk matches.
4. Name chunks, separated by symbols or mixed case, are recursed.
Names are broken into pieces at nonword characters, or capitalization changes. These parts are each recursively checked against the potentially matching name with fname_eq(). "Mary" would therefore match "MaryAnn", "Mary-Ann", or "Mary Ann", again with full certainty of the best submatch.
5. Inconsistent case is flattened to all upper case, and spaces and symbols are removed.
Legacy databases, particularly mainframes, tend to favor ALL UPPER CASE DATA. At this stage, "Arthur" matches "ARTHUR".
6. Nicknames are followed, matched, and ranked based on proximity, using Lingua::EN::Nickname.
This is an alarmingly difficult problem that eventually grew into its own module.
Lingua::EN::Nickname is a much more terrifying animal than Lingua::EN::Match-Names, due to the capricious nature of nicknames. Basically, it builds four giant hashes: one for looking up nicknames that have a single known root name (that is, full/formal form), one for looking up ambiguous nicknames, one for finding a regular expression matching all known forms of a root name, and one for mapping related root names. The functions in this module are fairly straightforward users of these hashes.
This vast hash data should not be edited by hand. Although I would prefer to receive mail regarding omissions, a utility for generating the Perl code for these hashes (nickhash.pl) is provided, and is a good starting point for other languages (Lingua::XX::Nickname). The original datafile used by my utility, nicknames.txt, is also included; each line consists of three tab-separated fields: root name, nicknames (a space/nonword-separated list), and related root names.
Any certainty returned by nickname_eq() is scaled back a little, and returned.
Continuing where we left off, the remaining four steps for matching first names are:
7. Misspellings are detected, within a threshold, using String::Approx.
Data-entry errors are tolerated by determining how many character insertions, transpositions, and deletions would be required to change one of the names into the other, and ensuring that this number is below the default threshold for String::Approx (10% of the characters, rounded up) for matches. "Bart" and "Bort" would now match one another, and "Brian" and "Brain" would now match.
8. Phonetic matching catches similar pronounciation, using Text::Metaphone.
Homonymous names are now matched using the surprisingly accurate Text::Metaphone module, which allows us to find "Sandy" = "SanDee", and "Cindy" = "Sindy", but with relatively low certainty.
9. Somewhat similar last names are caught with Text::Soundex.
Out of completeness, I've included soundex matching, which is very tenuous (especially for first names). Soundex matches return a very low certainty.
10. Regular expression check for simple truncation and relevant initials.
Initials are re-checked to see if they may not have been pure noise ("H. Ross" = "Herman"), and simple truncation is checked ("Bri" = "Brian"). Because of the tiny amount of matching data, this is the least certain match.
Matching last names with lname_eq() is a slightly different process:
1. Simple equality is tested.
2. Extraneous suffixes are removed.
Persons usually include unused names as initials formally; frequently, this means initials are simply noise. "Smith, Jr." matches "Smith" with a high certainty level; "Smith II" would also match at this stage.
3. Hyphenated surnames are recursed.
Names are broken into pieces at hyphens; "Bouvier-Simpson" would be recursively checked to match the other name against "Bouvier" and "Simpson", and if any matches are found, the score of the best submatch is returned.
4. Inconsistent case is flattened to all upper case, and spaces and symbols are removed.
Legacy databases, particularly mainframes, tend to favor ALL UPPER CASE DATA. Also, handling of names that contain apostrophes or spaces is terribly inconsistent. At this stage, "O'Neil" matches "O NEIL" and "ONEIL".
5. Misspellings are detected, within a threshold, using String::Approx.
"Hanson" and "Hansen" would now match, as well as "Simpson" and "Smipson".
6. Phonetic matching catches similarly pronounced last names, using Text::Metaphone.
7. Somewhat similar last names are matched with Text::Soundex.
8. Regular expression check for nonstandard hyphenation or simple truncation.
Nonstandard hyphenation (like lower-to-upper case changes) are caught and recursed.
The certainty scores returned by fname_eq() and lname_eq() are generated from weights I assigned after ranking each of these steps, based on frequency of use, reliability, some final tweaking after checking against my test data, and on my fondness for round numbers. These scores seem to be pretty reliable relative to each other, but ultimately have an inadequate scientific basis. Anyone willing to do research that would result in more accurate scores for these weights is encouraged to do so, and will receive credit in future versions of this module, plus my thanks.
So there you have it: how to match names with Lingua::EN::MatchNames and Lingua::EN::Nickname, and a peek under the hood to see how they work. If you are looking for a project, and have access to a vast database of names, the most critical work is precisely refining the weight of each step, and filling in any gaps in the list of nicknames._ _END_ _