The logic behind Roamy
The Population Project has embarked on the creation of a list-searching bot called Roamy.
AB
Antoine Bello · November 7, 2023
In the past three years, we have become pretty skilled at finding good lists on the internet (good lists are long and contain dates or at least years of birth). They come in various formats, PDF and xls the most common. We have about 60,000 of those, scattered across all countries using the Roman alphabet.
During the first two years, we would look for lists in deliberate fashion, feeding Google such queries as “résultats baccalauréat Mali 2019” or “list attorneys Tasmania”. While this approach has brought us several hundred million records, it’s showing its limits for two reasons. First, it begets homogeneity as it’s based on the assumption that all countries have the same types of lists. Nothing could be further from the truth. Mexican states don’t issue exam results but they publish the lists of children receiving a hot breakfast at school. Moreover by using the same queries over and over again, we miss the tail-end results with a low page rank.
So we’re now searching very differently, by combining a last name, a date and a file type. “Gonzalez” “23 10 1987” filetype:xls will only return spreadsheets containing that particular date and name. Chances are they contains lots of other names with as many dates. To search for African lists, all you have to do is replace “Gonzalez” by “Diallo” or “Traore”.
This new approach has already yielded spectacular results. We’re trying to go one step further by fully automatizing the search process. A bot called Roamy will run thousands of queries, analyze the results (does this PDF contain dates? First and last names? Does it have a minimum number of records?) and select the best candidates for human curation. Very soon, we’ll know which domains or subdomains yield the best lists and which ones should be ignored.
Mediapark, our web agency, is currently developing Roamy’s prototype. We can count on Bright Data’s technical help. We’ll share screenshots soon.