Memo on the Population Project's database structure
A not-too-technical explanation on how the Population Project's database will be structured.
Antoine Bello · October 5, 2021
Volunteers or freelancers collect lists of humans around the world. Some of these lists are several millions long (i.e. registered voters in the state of New York), others only a few hundred. Some lists contain all the fields we’re looking for (i.e. first name, middle name, last name, maiden name, sex, country, date of birth, place of birth), others just contain bundled names (“John Smith” or “John T Smith”).
We clean the lists (converting files from PDF to Excel, getting rid of the unnecessary columns, removing duplicates). Files are then ready to be imported. Imports are called jobs. Depending on their origin, lists are labeled as official, semi-official or informal. The level of certainty of each field is set accordingly. All lists are assigned a nationality. For instance, “List of graduates from Harvard Business School” is labeled “USA”. “List of public employees from the state of Yucatan, Mexico”, is labeled “MEX”. A few lists, such as “Olympic medal winners” are labeled “INT” (for “International”). In this case, each human on the list has their nationality clearly specified. Under no circumstances can a human be imported without a country.
Logic of the structure
We will operate on the HMVC (Hierarchical Model View Controller) model. Basic tasks will be broken down between several bases linked together.
Bases will include, among others:
- a background base encompassing all low-variability data such as countries, first names, last names, language-related info, etc.
- an ingestion base, which will work as a chamber between excel files and the main repository.
- a main repository, which will contain all human records organized by countries for small countries, and by countries/alphabetical tables for larger countries. If or when the main repository becomes too large it can always be broken down into several bases after a country/continent logic.
- a statistics base, which will periodically query the main repository to compute and store statistics.
Logic of import jobs
a. Selecting the import file
We will put controls in place in order to limit the ingestion of junk.
- make sure there is not text in the numeric columns and vice-versa;
- make sure by sampling that the first and last name columns haven’t been mixed up;
- raise an alert when too many first or last names from a list can’t be found on Forebears, etc.
- other tests to ensure that we never import a defective file.
b. Logging the data (ingestion base)
Each human on the list is processed separately and goes through several steps:
Names are cleaned (to get rid of spaces, periods, etc.).
As names are often bundled, we separate them using a proprietary algorithm. For instance, “Bundled name : De la Tour Pierre Jacques Henri” becomes Last name: De la Tour; First name : Pierre; Middle name : Jacques Henri.
If the nationality of the human is unknown, we give it the overall nationality of the list. Our concept of nationality is intentionally loose to account for people who have lived in several countries. A same individual can be listed under two or three different countries. We might want to remedy that at some point, but for the time being, we think it’s not a major issue.
If we ignore the sex of a human, we try to guess it based on their first name and country. We only assign a gender if it’s more than 95% certain in Forebear.io’s API.
We register the year, month, and day of birth if we have them (three separate fields). The year of birth can sometimes be estimated from the nature of the list. For instance, a French student taking the Baccalauréat has a 95% likeliness to be between 17 and 21. In that case, we log the year corresponding to an age of 19, with a confidence margin of 2 years (in a separate field).
We register the city, region, country of birth when we have them (rare).
c. Computing additional information (ingestion base)
- We log what we call the full name (last name + ”_” + first name), for instance “Smith_John” or “De la Tour_Pierre”.
- We transform each name (first, middle, last, maiden) into a soundex or metaphone, which will greatly expand the functionalities of the search. Johnson_Michael becomes J525_M240 (we’re using the most basic soundex as an example, we might resort to a more sophisticated one).
- We elaborate on the soundex to calculate a hash that takes into account the other information we have about Michael Johnson. We use a smart hash that prioritizes the most important/discriminating information by placing them earlier in the string. For instance, we are much more likely to know the human’s birth year but not the day than the other way round.
- We compute three individual indexes, comprised between 1 and 100 on a logarithmic scale.
o The basic index expresses how unique a human is, first based on their first and last name. Michael Johnson for instance has a 90+ basic index. To avoid querying the “background” base incessantly when importing, for instance, American names, we’ll store US first and last names in an array in the ingestion base for the duration of the import. o The full index expresses how unique a human is, based on all the information we possess. Because we know a lot about him, Michael T Johnson, born on April 24, 1973, in Atlanta, GA has a full index of 1 (1 symbolically expressing the unicity of each human being). o The completion index expresses how much information we have on an individual. Each field adds a certain number of points, up to 100 (when we know all there’s to know).
d. Comparing the human’s soundex with the soundex of humans already in the base (ingestion base)
The ingestion base will contain one table per country with all the distinct soundex pairs. When dealing with human “Michael Johnson”, we’ll look in the Soundex_pairs_USA table whether the corresponding soundex, “J525_M240”, is new or not.
If it is, we add “J525_M240” to the base Soundex_pairs_USA and we add the property “new” to our object “human”. If if isn’t, we give the object human the property “test”. Then the object is placed into a temporary collection (or into two temporary collections: one for all .new items and one for the .test items).
e. Creating new humans or enriching existing ones (main repository)
Every 1,000 records, we send the collection(s) to the main repository.
For lack of testing, we’ll assume for the moment that Humans tables will host up to 25 million records (roughly the population of Australia). Countries smaller than Australia will fit in one table. The US will require approximately 14 tables, China and India 50 or 60 tables each. It might make sense to regroup small countries according to a geographical logic (Caribbean countries, Baltic countries, Pacific islands, etc.).
In countries over 25 million people, humans will be dispatched to a specific table based on the initial of their first name. In the US, for instance, the tables will go Humans_USA_BED, Humans_USA_CAF… all the way until Humans_USA_ZZZ. Australia’s only table will be Humans_AUS_ZZZ.
Humans with the property “new” are immediately created in the table corresponding to their name, while humans with the “test” property need to be compared to existing records. Let’s say there are 7,830 humans with the soundex pair J525_M240. We look within this selection how many humans share the same hash as our Michael Johnson. If there are none, we create our record. If there are one or more, we have no other choice than to compare the records one by one.
In so doing, we go by a basic principle: we only create a human if he/she can be differentiated from all other humans in the base on at least one feature. We won’t import another John Smith born in 1983 if we already have a John Randolph Smith born in 1983 (because they could be the same person). On the contrary, we will import a John T Smith born in 1983 (as John T Smith and John Randolph Smith are clearly different individuals).
When in doubt, we err on the side of caution. The only instances where we “merge” two humans is when their basic index makes us comfortable enough. A Michael Johnson born on 3/25/70 and another Michael Johnson born on 3/25/71 are very likely to be two different persons since more than one Michael Johnson is born every day in the USA. On the other hand, if we have two Nanshapur da Oliveira in Italy, one born on 3/25/70 and the other born on an unknown date in 1970, they’re very likely the same person. The various thresholds used in the algorithm need to be finetuned.
Last rule: corroborating a bit of information increases the level of certainty of the corresponding field, using a simple set of rules that can probably be improved:
- If the old information (can be a maiden name, a year of birth, etc.) has a lower certainty that the new information, we take the certainty of the new information.
- If the old information is more certain than the new information, we keep the certainty of the old information.
- If old and new information are equally certain, we increase the certainty according to the following formula: Increased Certainty = 1- ((1 – Old Certainty)*0.85)). For instance, two 80% sources result in an 83% certainty.
Structure of the bases
a. Background base
The background base includes the following tables:
- One country table, listing for each country its ISO-3 code, population, etc.
- One first_name_sex table per country. Each table contains between 1 000 and 30 000 common first names of the country + the gender breakdown. Necessary to calculate the basic index. Syntax: First_name_sex_FRA.
- One last_name table per country. Each table contains between 1 000 and 20 000 common last names of the country. Necessary to calculate the basic index. Syntax: Last_name_FRA.
- Two tables to store the results of all the Forebears API queries we perform (so that we don’t run the same queries over and over).
b. Ingestion base
- One import table, which might not even be necessary as records could be pass-through and stored only temporarily.
- One job table to record imports and all metrics associated with them (number of humans processed, created, merged, etc.).
c. Main repository
Humans tables contain seven fields. Whether all of them need to be indexed remains to be decided.
- the human id;
- the job id (through which job was this human imported?);
- a name object. Attributes include: first, first_certainty, first_metaphone; middle, middle_certainty, middle_metaphone; last, last_certainty, last_metaphone; maiden, maiden_certainty, maiden_metaphone.
- a sex object. Attributes are sex (1 or 2) and certainty.
- a birth object. Attributes include : day, day_certainty; month, month_certainty; year, year_certainty, year_confidence_margin; city, city_certainty; region, region_certainty; country, country_certainty.
- a death object. Same attributes as birth.
- a human_index object. Attributes (all three integers between 1 and 100) are: basic index; full index; completion.
Birth and death years can be expressed as intervals (1970 with a confidence margin of 2 years means “between 1968 and 1972”. This is especially useful when the year of birth can be estimated (for instance, people finishing high school in the US are most likely between 17 and 21 years old).
The main repository will have to be broken down into several bases. Because a 4D base can only sort in one alphabet, we’ll need dedicated bases for China, Japan, Thailand, Korea, etc.
Some data will be used much less frequently than others (for instance all death-related info). We might want to store it in another base. As an example, Australia would have two “Humans” bases, one with the basic info (names, birth, sex, etc.) and another for the death info. In the beginning, the “Death” base would be almost empty and would fill up extremely slowly. Or the first base would be only for live people and at the time of their death, they would migrate to the “Dead people” base. The merits of these ideas need to be further assessed.
d. Statistics base
Workers will query the main repository at regular intervals to calculate various indicators.
- Level of completion of humans per country (as an example, we have few records in Cameroon but they’re almost complete, lots or records in Mexico but they’re almost empty).
- Most popular last names, per country and overall.
- Most popular first names per gender, per country and overall.
- Average age per country.
- Gender breakdown by country (women tend to be underrepresented in many countries’ lists).
Some of the most interesting stats will pertain to the Population Project itself.
- Percentage of humans ingested that were already present in the base (will decrease over time).
- Number of humans per country, as a number and as a percentage of the country’s population.
- Number of humans added in the last hour/day/week/month.
- Number of lists submitted by the volunteers.
- Titles of the most recently submitted lists, etc.
- Backlog of lists and humans to process, etc.
Results will be stored in “historic tables”, making it possible to calculate variations and display graphs.
Search functionality (website)
Because we’ve chosen to organize humans by country, users of the website will be strongly encouraged to select a country prior to running a query. One way could be to preselect the country corresponding to their IP address in the “Country” dropdown menu.
If a query doesn’t return any results for a specific country, we’ll offer to extend it to neighboring countries (e. g. South American countries for Argentina) or to the entire world. One could imagine the following message. “Your query for Luigi Amekporphor didn’t yield any results. Would you like to extend it to other European countries or to the rest of the world (will take longer)?”.
From our perspective, searching for one name in the entire world means querying 239 bases, a number that we think can be reduced to 100 or so by regrouping small countries.
The use of metaphones will allow us to display extensive results, returning for instances the various spellings of Jansen or Mohamed.
In the long run, we might give researchers the possibility to build and download (anonymized or not) datasets.
Challenges and constraints
- Performance. Should all fields be indexed? Can the dialogue between the import base and the repository be fast enough to accommodate the ingestion of millions of records a day?
- Preemptive programming so we can run multiple import jobs in parallel.
- Handling various alphabets, each requiring their own database for sorting purposes.
- Working with multiple developers. The HMVC structure of the project should help.
- Create workers and use artificial intelligence to sweep the data at regular intervals for potential mistakes, typos, duplicates, etc. Over time, the names of places will need to be harmonized (London, Londres, Londra, Londen, etc. should be “London” ).
- 24/7 availability of the site, irrespective of maintenance.
a. Phase I
- Build the import process. This will force us to: settle on a field structure, choose a soundex method, program the hash, etc.
- Start work on the website.
b. Phase II
- Trial imports to test key assumptions such as number of humans per table, ideal size of import batches, etc.
c. Phase III
- Build the stats base: build ad hoc workers to query the main repository; archive historic results; implement visualization tools.