We've started reimporting records

Three months ago, realizing we had way too many duplicates in our database, we stopped importing lists to take a closer look at our algorithms. We now believe we have fixed the most glaring problems. After wiping out the entire database, we’ve started reimporting US records.

AB

Antoine Bello · April 12, 2023

Three months ago, realizing we had way too many duplicates in our database, we stopped importing lists to take a closer look at our algorithms. We now believe we have fixed the most glaring problems. After wiping out the entire database, we’ve started reimporting US records.

What have we learned?

First, we put a greater emphasis on dates. We used to accept long lists of names with no birth-related info, thinking they formed a great frame to later be fleshed out with details. The trouble is most names are very common. How do you know if “Roger Brown born in Atlanta” and “Roger Philip Brown” are the same person? Add the date of birth to the equation and the probability that they are is multiplied by 25,000. This has allowed for lots of records to be merged. Most lists available online do not include DOB, but you’d be surprised how many we’ve been able to find.

Second, we no longer infer the year of birth. When processing the results of the French baccalauréat, for instance, we used to infer that most candidates were between 17 and 21 years old, or, in our parlance: 19 +/- 2. We’ve dropped this practice because a mere 5% of candidates outside of these bounds translated into tens if not hundreds of thousand errors in our base. The only exception we make is to deduce the birth year of a deceased person based on their age. If someone died in 2022 at the age of 72, we can say with absolute certainty that they were born in 1949 or 1950.

Third, we stay away from unofficial lists. They simply present too many risks, from the use of nicknames (“Bill” for “William”) to truncated names (“Jose Michel Gonzal”) to their limited size (some are simply too short to make them worth our while). In a nutshell:

  • We love electoral and vital records, obituaries, exam results, lists of social aid recipients and certified professionals.

  • We like professional elections colleges, lists of sports federation members and college graduates.

  • We are wary of marathon results, trade show and conference attendants, political donors and delinquent taxpayers.

Of course this is all relative. While we can afford to be picky in Western countries, we might have to lower our standards in Africa or Asia.

Fourth, we have improved our geo system. Each location - whether a city, region or country - has a unique ID, whose design reflects the hierarchical geo chain. We now know just by looking at their IDs that a person born in Boston and an homonym born in Massachusetts can actually be the same person.

Last, we’ve put in place stringent mechanisms to detect typos. In the past, when confronted with a “Frederack Smith”, we would have thought: “We don’t know this first name. Let’s add it to our list and import this person.” We now hold on to the record, until we encounter more Frederacks. Past a certain number of occurrences, we create the records that were on hold. Otherwise, we delete them. This system has helped us identify scores of bad records, with first names such as “MichaelT” or “CarolM” that were in fact “Michael T” and “Carol M”.

The first three countries we plan to import are the United States, the United Kingdom and France.

We will also announce new important functionalities soon.

Thank you for your continued support. We once compared the Population Project to a marathon. It looks more and more like some insane ultra-running race!