The Population Project has temporarily suspended its activities. Our database of about 700 million and 200 million dead remains freely accessible. Read more.

They say "Learn from your mistakes"

They say "Learn from your mistakes"

AB

Antoine Bello · January 9, 2023

In the past three months, we’ve imported over 350 million records in the database of the Population Project. In the next few days, we will wipe out all of them and start from scratch.

Why? Because we have too many duplicates, which is a sure sign that we have gaps in our filtering algorithm.

Some duplicates are inevitable. The “Jane Simmons born on 2/7/1983” got married and became “Jane Milbanks born on 2/7/1983”. The matching date of birth is not enough to infer that the two Janes are the same person. Not surprisingly, women account for 53% of our US records, when the real percentage is closer to 51%.

Other common sources of duplicates include typos (Paul Summers, born 5/7/1997 and Paul Summers, born 7/5/1997 are only separated by a date-keeping convention), name changes, truncated columns (shrinking Paul Summers’ name to “Paul Summ”). We’ve always been aware of these pitfalls, and while we do everything to address them, we know they will probably account for 1 to 2% of all records, until we develop more sophisticated tools to spot them.

The problem right now is we have way more than the expected 3 to 5%. We’re already tracking at 102% of the US population, even though we have very few records of people under 18. This clearly indicates that our “disambiguation algorithm,” as we pompously call it, is too porous. As a reminder, we try to match every new record we import with one already existing in the database. Ideally, the new record will corroborate the old one or even enrich it with another datapoint.

While reviewing our code one more time, we noticed a few gaps. Depending on which record came first, “John K Smith” and “John Kendrick Smith” were not always deemed compatible. To take another example, two “Mary Louise Miller” were not recognized as the same person, because the first one was labeled as “Female” (as per the list from which we extracted her information), while the other was only presumed to be “Female” (based on her first name).

Once we’re done plugging all these holes, we will wipe out our data and start afresh, which should take less time than hunting for duplicates among the existing records.

This is clearly disappointing. We’re working very hard to give the world its first population register but we’re only human. We’ll make more mistakes along the way. Rest assured though that we’re learning from each of them and that our enthusiasm has never been stronger.

A nonprofit organization striving to compile a list of every living person’s full name and place and date of birth.
The Population Project relies heavily on the work and contributions of volunteers. We believe that information-gathering and use should go hand-in-hand with transparency. This Privacy Policy explains how the Population Project, the non-profit organization that hosts this site, collects, uses, and shares information we receive from you through your use of the Population Project Site. It is essential to understand that, by using the Population Project Site, you consent to the collection, transfer, processing, storage, disclosure, and use of your information as described in this Privacy Policy. That means that reading this Policy carefully is important. We believe that you shouldn't have to provide nonpublic Personal Information to participate to the Population Project. You do not have to provide things like your real name, address, or country to sign up for a standard account or contribute content to the Population Project Site. We do not sell or rent your Personal Information, nor do we give it to others to sell you anything. We use it to figure out how to make the Population Project Site more engaging and accessible. Put simply: we use this information to make the Population Project Site better for you.