The Population Project has temporarily suspended its activities. Our database of about 700 million and 200 million dead remains freely accessible. Read more.

The logic behind Roamy

The Population Project has embarked on the creation of a list-searching bot called Roamy.

AB

Antoine Bello · November 7, 2023

In the past three years, we have become pretty skilled at finding good lists on the internet (good lists are long and contain dates or at least years of birth). They come in various formats, PDF and xls the most common. We have about 60,000 of those, scattered across all countries using the Roman alphabet.

During the first two years, we would look for lists in deliberate fashion, feeding Google such queries as “résultats baccalauréat Mali 2019” or “list attorneys Tasmania”. While this approach has brought us several hundred million records, it’s showing its limits for two reasons. First, it begets homogeneity as it’s based on the assumption that all countries have the same types of lists. Nothing could be further from the truth. Mexican states don’t issue exam results but they publish the lists of children receiving a hot breakfast at school. Moreover by using the same queries over and over again, we miss the tail-end results with a low page rank.

So we’re now searching very differently, by combining a last name, a date and a file type. “Gonzalez” “23 10 1987” filetype:xls will only return spreadsheets containing that particular date and name. Chances are they contains lots of other names with as many dates. To search for African lists, all you have to do is replace “Gonzalez” by “Diallo” or “Traore”.

This new approach has already yielded spectacular results. We’re trying to go one step further by fully automatizing the search process. A bot called Roamy will run thousands of queries, analyze the results (does this PDF contain dates? First and last names? Does it have a minimum number of records?) and select the best candidates for human curation. Very soon, we’ll know which domains or subdomains yield the best lists and which ones should be ignored.

Mediapark, our web agency, is currently developing Roamy’s prototype. We can count on Bright Data’s technical help. We’ll share screenshots soon.

A nonprofit organization striving to compile a list of every living person’s full name and place and date of birth.
The Population Project relies heavily on the work and contributions of volunteers. We believe that information-gathering and use should go hand-in-hand with transparency. This Privacy Policy explains how the Population Project, the non-profit organization that hosts this site, collects, uses, and shares information we receive from you through your use of the Population Project Site. It is essential to understand that, by using the Population Project Site, you consent to the collection, transfer, processing, storage, disclosure, and use of your information as described in this Privacy Policy. That means that reading this Policy carefully is important. We believe that you shouldn't have to provide nonpublic Personal Information to participate to the Population Project. You do not have to provide things like your real name, address, or country to sign up for a standard account or contribute content to the Population Project Site. We do not sell or rent your Personal Information, nor do we give it to others to sell you anything. We use it to figure out how to make the Population Project Site more engaging and accessible. Put simply: we use this information to make the Population Project Site better for you.