A day at the Population Project: Brazilian names
AB
Antoine Bello · February 9, 2023
Among the many difficulties we face at the Population Project, none carry higher stakes than names. We’re not talking about the complexity induced by multiple alphabets or the relative scarcity of Korean names that makes for an inordinate number of homonyms. Those are future sources of worry as we’re only tackling Roman-alphabet countries for the moment. Yet, who would have thought Brazilian full names would pose so many problems?
Brazilians typically have two first and two last names, some of which can be composed as in “Guilherme João Paulo Villas Boas Oliveira dos Santos”. To make matters worse, the lists we find are often stripped of their special characters such as hyphens or tildes. Out of the 150 million Brazilian full names we have processed, none was broken down, as is customary in other Latin American countries.
This leaves us with an alternative: store the full name in one string (not ideal from a search perspective) or try to break it down in two or in four elements. We chose the latter option, only to discover it was more challenging than it looked as scores of Brazilian names can be both first and last names! When in doubt, we rely on frequency tables. For instance, “Braga” is 2,000 times more likely to be a last name, while “Juvita” is 100 times more likely to be a first name. If the ratio is less than 10:1, we skip the record altogether. We ALWAYS err on the side of caution.
And of course, Brazil, as all countries, has its outliers, first names such as “Al Capone”, “Michael Jackson” or “Lady Di” that should in no way be separated. Well, at the risk of disappointing, expect a few mistakes on that front.