They say "Learn from your mistakes"
AB
Antoine Bello · January 9, 2023
In the past three months, we’ve imported over 350 million records in the database of the Population Project. In the next few days, we will wipe out all of them and start from scratch.
Why? Because we have too many duplicates, which is a sure sign that we have gaps in our filtering algorithm.
Some duplicates are inevitable. The “Jane Simmons born on 2/7/1983” got married and became “Jane Milbanks born on 2/7/1983”. The matching date of birth is not enough to infer that the two Janes are the same person. Not surprisingly, women account for 53% of our US records, when the real percentage is closer to 51%.
Other common sources of duplicates include typos (Paul Summers, born 5/7/1997 and Paul Summers, born 7/5/1997 are only separated by a date-keeping convention), name changes, truncated columns (shrinking Paul Summers’ name to “Paul Summ”). We’ve always been aware of these pitfalls, and while we do everything to address them, we know they will probably account for 1 to 2% of all records, until we develop more sophisticated tools to spot them.
The problem right now is we have way more than the expected 3 to 5%. We’re already tracking at 102% of the US population, even though we have very few records of people under 18. This clearly indicates that our “disambiguation algorithm,” as we pompously call it, is too porous. As a reminder, we try to match every new record we import with one already existing in the database. Ideally, the new record will corroborate the old one or even enrich it with another datapoint.
While reviewing our code one more time, we noticed a few gaps. Depending on which record came first, “John K Smith” and “John Kendrick Smith” were not always deemed compatible. To take another example, two “Mary Louise Miller” were not recognized as the same person, because the first one was labeled as “Female” (as per the list from which we extracted her information), while the other was only presumed to be “Female” (based on her first name).
Once we’re done plugging all these holes, we will wipe out our data and start afresh, which should take less time than hunting for duplicates among the existing records.
This is clearly disappointing. We’re working very hard to give the world its first population register but we’re only human. We’ll make more mistakes along the way. Rest assured though that we’re learning from each of them and that our enthusiasm has never been stronger.