The Population Project has temporarily suspended its activities. Our database of about 700 million and 200 million dead remains freely accessible. Read more.

When we have too many records...

Having too many records can be the sign of flaws in our processes.

AB

Antoine Bello · April 29, 2024

Following the import of large data files, we are now dangerously close to a 100% coverage of the US population. Why is it bad news? will you ask. Well, we know that we are still pretty far from that milestone in reality, if only because we have few files pertaining to children and teenagers. This in itself should keep us somewhere below 80%. The question is then: what are those excess records?

We’ve made lots of progress recently to identify duplicate profiles. We now know with greater certainty when to merge two similar records and when to let them live their own separate lives. We can still refine our algorithms but the crux of the explanation probably lies elsewhere. 

We suspect the primary reason to be maiden names. Women account for 55% of our US records, vs 51.1% in official statistics. A difference that wide probably finds its source in the fact that most women change names when they get married, exposing themselves to be captured twice in our database. Few US organizations ever mention maiden names, unlike Catholic countries.

Other factors come into play, namely humans errors - ours and those committed by the official bodies providing our data. Names on PDF files are truncated (“Jacqueline” becomes “Jacqueli” or “Jacquel”), dates are transposed (“10/12/83” becomes “12/10/83”), typos appear here and there, rendering the comparison of new records with existing ones delicate if not outright impossible.

We make mistakes too. The first one is not to catch all those that happened before us. And sadly, we sometimes corrupt some perfectly good files. Labeling a column “Death year” instead of “Birth year” can result in the creation of half a million erroneous profiles. This has happened before and we’re humble enough to know it will always remain a possibility.

One thing gives up hope though. AI and machine learning will make it easier to detect patterns within huge volumes of data. Acquiring fresh data will remain key but tending to it will be just as important. Feel free to contact us if you have special talents in that department and would like to help.

A nonprofit organization striving to compile a list of every living person’s full name and place and date of birth.
The Population Project relies heavily on the work and contributions of volunteers. We believe that information-gathering and use should go hand-in-hand with transparency. This Privacy Policy explains how the Population Project, the non-profit organization that hosts this site, collects, uses, and shares information we receive from you through your use of the Population Project Site. It is essential to understand that, by using the Population Project Site, you consent to the collection, transfer, processing, storage, disclosure, and use of your information as described in this Privacy Policy. That means that reading this Policy carefully is important. We believe that you shouldn't have to provide nonpublic Personal Information to participate to the Population Project. You do not have to provide things like your real name, address, or country to sign up for a standard account or contribute content to the Population Project Site. We do not sell or rent your Personal Information, nor do we give it to others to sell you anything. We use it to figure out how to make the Population Project Site more engaging and accessible. Put simply: we use this information to make the Population Project Site better for you.