Populating the United States of America
How we quickly acquired 70% of the US population.
Antoine Bello · June 6, 2021
Six months ago, when we started to think about what would become the Population Project, we decided to begin our investigations with the two countries we know best: France and the United States. Granted, these countries can hardly be viewed as representative of the world population. Already highly digital, they could be expected to post more information online than the average country. But that was precisely the point: if we couldn’t quickly account for a significant proportion of the French and American population, then the notion of one day mapping India or the African continent was outright delusional.
The US quickly proved to be a public data Eldorado. For starters, voters’ lists are available to the public in every state. Sometimes you can download them on the state or county electoral commission’s site, sometimes you have to order a disk. Each record contains the first and last name, the middle initial, the sex and the date (or only the year) of birth. In a little more than a month, we were sitting on 200 million records!
It took us a bit longer to discover a second gold mine: lists of public employees along with their salaries and pensions. Some states go to great lengths to conceal their data but with a bit of tenacity, we were able to get ahold of most of it. Needless to say, we have no interest in how much civil servants make in Minnesota. We only store a handful of data points for each human and salary isn’t one of them.
With something like 80% of the American adult population in hand, we shifted our efforts to smaller datasets. For instance, there are several million inmates and sex offenders in the US. Over a million students graduate from college every year, their names printed in universities commencement programs. This made us realize that if we didn’t know the graduate’s dates of birth, we could still guess their ages with a two or three-year confidence margin.
We then branched into sports. Individual sports federations typically publish rankings of their discipline. Swimming, track & field, ski proved the most fruitful avenues, with chess, bridge and Rubik’s cube (50,000 names!) not far behind.
It became obvious at this point that younger generations would be the most difficult ones to come by. Except for 12-year-old swimmers and the occasional chess prodigy, we are desperately lacking in the under 18 category.
Overall, we consider our American campaign an unmitigated success. It has left us with hundreds of millions of records, including lots of overlapping ones that enable us to corroborate data and to gradually enrich profiles.
Next week: France!