Methodology
The Population Project collects what we deem is the bare minimum to establish the unicity of a human being:
·
full name (first, middle, last and maiden names);
·
sex at birth;
·
date of birth (year, month, day);
·
place of birth (country, region, city);
·
date of death (year, month, day);
·
place of death (country, region, city);
We consider other potential features to be either irrelevant (profession, income, address or phone number…) or potentially discriminatory (race, religion…). We do not log parent’s identity, because it is often missing or wrong.
All the information we collect is readily available on the internet, without the use of any login or password. Important sources include:
·
electoral records
·
vital records (birth, marriage, divorce, death)
·
exam results
·
certified professionals such as nurses, accountants, actuaries, etc.
Once recorded, a name cannot be traced back to its source. Whether we found the name of John Doe in a list of inmates, triathletes, or college graduates, will not appear anywhere on the site.
Name Structures
Countries have different name structures. Most Americans have a middle name, Kazakhs don’t. Germans take the name of their father, Spaniards take that of their two parents. Such diversity requires a certain agility on our part. Spanish or Mexican lists contain four fields: “primer nombre, segondo nombre, apellido paterno, apellido materno”. A French list has only three “prénom, autre(s) prénom(s), nom de famille”.
Determining females’ maiden names
Rare are the lists that provide both a female’s last name and her maiden name. When confronted with a new profile, we assume that a female’s last name is also her maiden name if and only if her age is less than a certain threshold specific to each country.
Say we find the record of Emilia Baumgarten, from Germay. If Emilia is 18, we will assume that “Baumgarten” is both her maiden and last name. If she is 23, we will record “Baumgarten” as her last name and leave her maiden name blank.
Determining gender based on first name
A clarification first: when using the term gender, we mean the sex as it appears on someone’s birth certificate. It is neither in our power no in our wishes to track people’s gender evolution throughout their life. Using the sex at birth is more practical than the sex at age 20 or 50.
We deal with three types of lists:
·
those that include the gender - the ones we prefer. 🙂
·
those whose very nature dictate the gender of the humans on the list, for instance “2022 graduates of the all-girls school of the Holy Child” or “2021 London Marathon results, male over 45 category).
·
lists that don’t include the gender. In most cases, we are able to determine the sex of the human based on their first name. We use the API of our partner’s site Forebears, which can tell for each combination of a first name and a country how likely it is to be male or female. When the probability is over 95%, we heed Forebear’s advice. Otherwise, we leave the field blank.
Estimating birth years
Not all lists contain birth dates. However it is sometimes possible to estimate the birth year with relatively high level or certainty. To us, an age window is better than no age at all as it can help us tell two records apart.
An American student graduating from high school in 2022 has a more than 95% likelihood to be 17, 18 or 19 years old, which we translate by assuming a birth year of 2004, with a confidence margin of +/- 1 year.
Other countries might call for higher confidence margins. In Tanzania for instance, students completing Standard Four (SFNA) are 11 years old, +/2 years.
Normalizing birth places
By its very nature, the Population Project deals with all types of lists in many different languages. Yet we must unify the names of cities and regions. Otherwise, “Joachim Gourdon, born 1/5/2007 in Paris” and “Joachim Gourdon, born 1/5/2007 in Parigi” would be considered two different persons just because “Parigi” is the Italian translation of “Paris”.
Our job is further complicated by the fact that lists sometimes use different administrative levels. The same person can be referred to be born in “Formigine”, in the “province of Modena”, in “Emilia-Romagna” and in “Italy”, with all these statements true.
After much consideration, we’ve decided to use the GeoNames taxonomy, with four administrative levels: Italy `>` Emilia-Romagna `>` Province of Modena `>` Formigine. For each item, GeoNames provides a list of translations: “Paris, 75016, Parijs, Parij, Parizo, etc.”.
The notion of country seems critical when you start listing humans, and yet it is difficult to pinpoint.
Citizenship might seem the route to go, but it presents several difficulties. About 15% of all humans are not officially registered and will therefore never have a passport. Borders shift and country names change, resulting in some people being born in countries that don’t exist anymore. And last but not least, citizenships can be acquired or given up over the course of one person’s life.
Using the country of residence would be equally flawed and would force us to store information that we don’t want to log.
After much consideration, we have decided to define a person’s country as an entity to which they have or have had substantial ties. If you were born in Ivory Coast, you might appear on a maternity registry there. If you then studied in France and took the Baccalauréat, you will appear on official school records. And let’s assume you spent the rest of your life in Argentina, we’ll probably find you on a list of chess players, gun holders, or certified accountants.
The notion of country seems critical when you start listing humans, and yet it is difficult to pinpoint. Citizenship might seem the route to go, but it presents several difficulties. About 15% of all humans are not officially registered and will therefore never have a passport. Borders shift and country names change, resulting in some people being born in countries that don’t exist anymore. And last but not least, citizenships can be acquired or given up over the course of one person’s life. Using the country of residence would be equally flawed and would force us to store information that we don’t want to log. After much consideration, we have decided to define a person’s country as an entity to which they have or have had substantial ties. If you were born in Ivory Coast, you might appear on a maternity registry there. If you then studied in France and took the Baccalauréat, you will appear on official school records. And let’s assume you spent the rest of your life in Argentina, we’ll probably find you on a list of chess players, gun holders, or certified accountants. So, yes, it’s technically possible for one person to have multiple records, although we doubt it represents more than 1-2% of the total at the moment. We do not have a way to tie these parallel records together, it might come one day. We use the ISO classification of countries.
There are many ways to define what constitutes a country. We’re using the least controversial one, the United Nations member states. Other territories are rolled into the country they’re a dependence of. As an example, the United States comprise:
·
DC and the 50 states;
·
American Samoa;
·
Guam;
·
Puerto Rico;
·
the Northern Mariana Islands;
·
the US Virgin Islands.
The world population currently stands currently stands around 8 billion. It is estimated that 120 billion people have roamed the earth since the dawn of man.
For the moment, the Population Project concerns itself with the 8 billion. Genealogists know how difficult it is to go back further than a few generations. We think listing all people alive will be challenging enough.
Our focus is currently on listing living persons. The Population Project started early 2021, so we’re technically looking for all people who were alive on January 1, 2021. But if our goal is to look at the current state of the world’s population, we will need to log deaths too. It will probably prove even more difficult than to find the living. Live humans do things that land them on lists—they get married, they take exams, they run races, they run for mayor. Dead people are—in the public sphere at least—quickly forgotten. If you miss their obituary, provided they had one of course, you might never hear of them again.
The level of information we collect on each human is low enough to not be intrusive and high enough to give us a sense of each life’s singularity. In other words, there might be hundreds of Patrick Bernards, but only one Patrick Bernard born on on 7/28/1987 in Quimper, Bretagne, France.
Is that really true? Let’s put the world’s most common last name to the test. There are about 100 million people with the name Zhang on Earth, most of them in China. Li accounts for about 2% of Chinese first names, so there are about 2 million Li Zhangs. Shanghai intra muros concentrates about 2% of China’s population, so again there must be around 40,000 Li Zhangs born in Shanghai in the past 70 years. There are about 25,000 days in 70 years, meaning that, on average, 1.5 Li Zhang is born in Shanghai every day. And we haven’t even used the middle name as a differentiator…
We have tried to express this notion in numbers. For each human in our base, we calculate two indicators, ranging from 1 (rare) to 100 (very common).
·
a basic index, which is purely based on the combination first name + middle name. “Linda Smith” scores 100 in the US and Canada, but only 15 in Italy (not a lot of Linda Smiths in Italy).
·
a full index, which takes all the information we have into consideration. Linda T Smith, born in Santa Fe, New Mexico, USA on 10/4/1998 scores 1, which basically means it’s impossible to confuse her with someone else).
As we measure the progress of the Population Project by the number of records we have, we also keep a close eye on the average full index of those records. We have a lot of records in Mexico, but they are of poor quality, meaning their full index is stil high. We have much fewer records in Italy, but they are excellent (full index close to 1).
Every time we process a new name, we ask ourselves the question: is it already in the base? There are three possible answers.
·
We know for sure it is not in the base. “Patricia Donzelli” in Mozambique? No match. We create a new record. Easy.
·
A similar record is already in the base and the new one doesn’t contain any additional information. Back to the example of “Patricia Donzelli”. Yes, we have one “Patricia Donatella Donzelli”. They might be two different persons but how to be sure? We ignore.
·
A similar record is already in the base and the new one contains additional information. Say the new record in “Patricia Donzelli, born in 1983”. We have one “Patricia Donatella Donzelli”. Because Patricia Donzelli is a rare name in Mozambique, it’s very likely that we’re talking about the same person. We merge the two records, resulting in one “Patricia Donatella Donzelli born in 1983.”
Needless to say, this comparison process has been optimized to run quickly and without manual intervention, to the tune of several million records a day.
For the moment, the Population Project’s database only stores records in Latin alphabet. Also called Roman alphabet, this set of characters is used as the official script in over 130 countries. It is predominant in the Americas, Oceania and Western Europe.
In the coming months though, we plan to introduce other alphabets starting with Chinese, Japanese, Korean, and Cyrillic. Ideally the Japanese name “Sato” will be stored in two fields: “Sato” and “佐藤”. Of course technical challenges abound as the word “Sato” can be expressed by a different set of kanjis and the kanjis 佐藤 can be read in other ways than Sato.
We’ll be sure to explain our methodology for each new alphabet in great detail.
The Population Project is committed to only using lists that are public in nature, which excludes lists protected by a login or a password. The directory of a university’s alumni is off-limits, unless of course the university puts it online without restriction. Public in nature means that we don’t use lists that have been hacked or acquired through illegal means.
We usually don’t pay for lists but we sometimes pay to acceleration the collection process. For instance, state and county electoral records are public in the US. Rather than contacting each of the 3,000 US counties, we pay for an intermediary who has already gone through the process.
We do not collect records from social networks, as they are littered with fake profiles (by their own admission, Facebook destroys 400 million records every quarter). Even genuine profiles do not pass our strict criteria (frequent use of nicknames, unreliable age, etc.).
Whether we found the name of John Doe in a list of inmates, triathletes, or college graduates, that information will not appear anywhere on the site.
We comply with the GDPR, a European regulation passed in 2016 and considered to be the world’s most stringent on privacy rights.
According to GDPR, personal data may not be processed unless there is at least one legal basis to do so. Ours is that we perform a task in the public interest.
We will erase any individual’s personal data at their request within 30 days, according to GDPR’s article 17.
The Population Project is a philanthropic endeavor, with no revenues at all.
We hereby commit to never selling your data.