Scraping Handbook

The Population Project

AB

Antoine Bello · May 16, 2021

So you’ve never scraped data before…

You might never have scraped data on the web in an organized fashion. Don’t worry, we’ll show you the basic tools and techniques needed to retrieve large quantities of names and store them in a spreadsheet such as Excel, Google Sheets or Numbers.

The Population Project has high standards regarding the quality of data we incorporate in our database. But it doesn’t mean we expect all our volunteers to come up with perfectly clean files. We are grateful for anything you can do and whatever task you haven’t been able to complete on your own will be forwarded to another, more specialized, volunteer.

This handbook is meant to show you the different steps that go into making a perfect file. If you don’t feel comfortable going past step 1 or 2, that’s fine. You will hopefully realize though that the following steps are not that much more difficult to master.

Step 1: Pointing us to an interesting source of information

Let’s say you’re the first Population Project’s volunteer in the Philippines. You’re interested in sourcing new lists but not so much in moving the data to Excel. In that case, when you find a list of new graduates from a state college, sent us an email to list@thepopulationproject.org, with the following:

We’ll take care of the rest.

Based on our experience, most people prefer to chase sources while a select few prefer exploiting them methodically. This is perfectly fine: both approaches are complementary. Besides, there’s enough work for everybody!

Step 2: Downloading a list or a document

Sometimes, the list or the document in question will be ready to export, as in this example (mind the “Export rankings” button on the right).

Export list example

Or the list is in itself an Excel or PDF document and can be downloaded, as in this example:

Download list example

Step 3: Copying and pasting

Granted, copy-paste is very 20th century but it works! Just position the cursor next to the first character you want to capture and drag it all the way to the bottom corner. Be sure to always copy full rows of data (this sometimes forces you to drag to the beginning of the next line).

Copy paste list example

Then paste into your spreadsheet.

Step 4: Scraping

Copy-paste will only get you so far. Long lists can contain hundreds of pages. In that case, we recommend you move to scraping, a perfectly legal process that speeds up the collection of data. The Population Project uses four or five different scrapers, some of them pretty sophisticated. Since we don’t want to turn this handbook into a technical manual, we will only talk about the easiest one to use.

Instant Data Scraper is a Chrome Extension that be downloaded here. Once it is installed in Chrome, it appears in the top right corner or your browser, looking like this:

Data Scraper icon

Let’s say you’re on a site that holds 30,000 athletes profiles. Each page returns 100 profiles. Copying and pasting 300 pages is out of the question. Instead, you click on the Instant Data Scraper button.

Immediately, the table with the first 100 players turns yellow, meaning it has been recognized by the extension. Another window opens automatically and shows you all the data organized in columns. You can delete the columns you don’t need or revert to the original columns (if you deleted one you shouldn’t have). You can even switch tables if there are more than one on the page.

Data Scraper example

See the three green buttons in the scraper? They let you save the data as a csv, xlsx (Excel) file or copy it all in your clipboard. That’s nice but we want more than just this one page, we want all 300! So click on the wide blue button “Locate “Next” button”. There is such a button under almost all multi-page tables. Once you click on it, it becomes green, meaning it has been recognized.

Data Scraper pagination example

The scraper now has everything it needs: the table and the columns you want, plus the button to move from each page to the next. You can now click “Start crawling” and let the scraper work its magic. In a matter of minutes (seconds sometimes), your data will be ready to be saved to Excel.

Just to be sure, Instant Data Scraper doesn’t work on all sites. In that case, we resort to more sophisticated software (which sometimes don’t work either). If you can’t get Instant Data Scraper to work on one list, go back to Step 1 and point us to the source. We’ll do our best to extract the precious data.

One last thing: lots of lists come in PDF format and require a different scraping approach. We use a free software called Tabula. Tabula is easy to install (both on Windows and Mac) and pretty intuitive to use. Give it a try if you feel like it. If you don’t, just send your PDFs our way.

Step 5: Cleaning your Excel file

Collecting the data is sometimes the easy part. Cleaning the Excel file to make it processable by our database can prove tricky and extremely time-consuming. Here is what you need to know:

  • If your list includes the sex, we only accept the values “MALE/FEMALE, “M/F” or “1/2”. So if your spreadsheet is in the form of “Boy/Girl”, please change the “Boy” into “MALE” and the “Girl” into “FEMALE”. See next section’s tips to do this efficiently.

  • In our database, all names are stored in three fields: first name, middle name, last name. Most lists however come with all three names bundled together, as in “Luisa Maria Cabrera”. If your list looks like this, leave it to us: our algorithm will make the best of the situation. But if your names look like “Luisa Maria CABRERA”, it's fair to assume that CABRERA is the last name. Go to the next section’s tip “Separate lowercase from uppercase” to learn how to break down one full name column into two.

  • We take dates in the US format. If your dates of birth are in the form of 14/3/1997, please change the format of that column to 3/14/1997 (Format > Cells > Date).

  • We love graduation lists because even if they don’t contain precise dates of birth, they still give us precious information about the year of birth. Someone graduating from high school is the US is likely 18 and almost certainly no younger than 17 and no older than 19. If you encounter that situation, create two additional columns: one for the tentative year of birth (2003 for someone graduating from high school in 2021) and another for what we call the confidence margin, in that case one year. In our jargon, confidence margin means that we are 95% certain that the humans on the list are between 17 and 19. If you feel that 1 year is too narrow an interval, make it two years. Your page would then look like this.

Sex First name Last name Year of birth Confidence margin
FEMALE Carolyn Ball 2002 2
FEMALE Jackie Bayes 2002 2
MALE Thomas Beck 2002 2
MALE Kyle Baum 2002 2
  • This notion of confidence margin can be applied in lots of circumstances. Here are a few examples.
Age Confidence margin
Swimmer in the under-16 category 15 1
Marathon runner in the 30-40 category 35 5
College graduate in the US 23 2

Please note that ages of graduation are highly specific to a country. If in doubt, err on the side of caution.

  • Lots of lists include countries. But countries go by lots of different names. According to the context, “Germany”, “Deutschland”, “GER”, “DEU”, “DE” all refer to the same country. We have chosen one standard, called ISO-Alpha-3. All names are identified by a combination of three characters, some pretty straightforward (“USA”, “FRA” for “France”, “ARG” for Argentina), some less so (“DZA” for “Algeria”, “ZAF” for South Africa, etc.). The list of abbreviations can be found here.
  • You will often encounter another widely-used 3-letter system, used by FIFA and the International Olympic Committee. In the next section, you’ll find a way to easily convert codes from one system to the other (never do it manually!).

Step 6: Turning in your files

We’re working on a volunteering portal that will let you upload your files easily. In the meantime, please send them to volunteers@thepopulationproject.org.

Give your files distinguishable names, such as ARG_Tennis_Federation_2021. You can also make our life much easier by including basic information at the top of your spreadsheet, such as your country, the source of the information, and the URL (see below).

Excel metadata

Last, to ensure maximum compatibility of accents and special characters, save your file as CSV UTF-8.

Pro Tips

Copy a cell all the way to the bottom of a column

You have an entire column of “MAN” that you would like to transform into “MALE”. You may of course change the first cell, copy it, and then drag all the way to the bottom of the column. But with a large file, this can become tedious and prone to mistakes if you release the pressure one second too early or too late.

The tip: double click on the bottom right corner of the first cell for its content to automatically copy all the way down. Just make sure that your list doesn’t contain any empty row for the copying will stop at the first empty cell.

Separate the last few characters of a string

A string is a chain of characters. Say your cells are in the form “Born in 1987”. You would like to separate the year from the rest of the string. In this case, use the formula RIGHT. If “Born in 1987” is in cell A1, writing from any cell =RIGHT(A1,4) will return 1987. LEFT(A1,4) would have returned, you guessed it, “Born”.

Using LEFT and RIGHT sometimes requires knowing the length of the string first. To obtain the length of “Born in 1987”, just enter =LEN(A1). The answer will be 12.

And, of course, we can combine both formulas. Say all your cells start with an unnecessary space. In that case, create another column, where the value of B1 is =RIGHT(A1,LEN(A1)-1). In other words, keep all the characters to the right of the string, minus one.

Text to Columns

After a successful scraping, you’re left with thousands of names in the form of “Müller, Rudi”. Of course, you could calculate for each name the position of the comma (using the FIND function) and then use LEFT to retrieve the last name and RIGHT to retrieve the first name. There is a faster solution though.

Select the entire column and go to Data > Text to Columns. Choose the first option (Delimited), click “Next” and indicate the character you would like to use to separate your text. By default, Excel offers tab, comma, semi-column and space but you can choose any character, including an opening or a closing parenthesis. In the “Müller, Rudi” example, we choose the comma.

Click "Next" and "Finish" and your column is now separated in two (or three or four if there are more than one instance of your pivot-character in the string.

Be careful, the new column(s) will appear automatically to the right of the old one. So make sure to give it some space first.

Separate uppercase from lowercase

The following formulas are especially useful if you want to break down strings such as “Kenji YAMAGUCHI” or “MALDINI Fabio”. They’re pretty complicated but they work!

1. Finds the first lowercase letter in a string in A1:

=MATCH(1,IF(ABS(CODE(MID(A1,ROW(INDIRECT("1:"&LEN(A1))),1))-109.5)<=12.5,1),0)

The formula will tell you that the first lowercase letter of "MALDINI Fabio" is in position 10 ("a"). That way, you can separate MALDINI by taking the first 7 characters (10 minus 3). Fabio comes by taking all characters as of character 9 (10 minus 1).

2. Finds the last lowercase letter in a string in A1:

=MAX(ROW(INDIRECT("1:"&LEN(A1)))\*(ABS(CODE(MID(A1,ROW(INDIRECT("1:"&LEN(A1))),1))-77.5)<=12.5))

Use to separate "Fabio MALDINI". The last lowercase letter is in position 5.

3. Finds the first uppercase letter in a string in A1:

=MIN(IF(ISERROR(FIND(CHAR(ROW(INDIRECT("65:90"))),A1)),"",FIND(CHAR(ROW(INDIRECT("65:90"))),A1)))

Use to separate "fabio MALDINI". The first uppercase letter is in position 7.

4. Finds the last uppercase letter in a string in A1:

=MAX(ROW(INDIRECT("1:"&LEN(A1)))\*(ABS(CODE(MID(A1,ROW(INDIRECT("1:"&LEN(A1))),1))-77.5)<=12.5))

Use to separate "MALDINI fabio". The last uppercase letter is in position 7.

VLOOKUP: how to easily convert country codes

You have an entire column of country codes. Unfortunately, they’re the wrong type… Germany is listed as “GER” (instead of “DEU”), the Netherlands as “NED” (instead of “NLD”), etc. Fortunately, you can correct this almost instantly by using a powerful Excel function called VLOOKUP. Go to this page and copy the main table anywhere in your spreadsheet. Delete all the columns but “A-3”) and “FIFA”. You then move the “FIFA” column to the left of the “A-3” column (i.e. the alpha-3 ISO, which is the Population Project’s standard.

Then insert a column next to the one hosting your faulty country codes. Your spreadsheet should look like this. Now in cell C2, you are going to write: =VLOOKUP(B2,$F$2:$G$260,2,FALSE)

Name IOC Code New code FIFA A-3
Aadya VARIYATH GER AFG AFG
Aaron BAI USA ALD ALA
Aaron CHEN NZL ALB ALB
Aarya Gharge NED ALG DZA

What is does is take the value in B2 (“GER”, find it in the first column of the table FIFA-A3 and return the corresponding value in the second column (hence the “2” in the formula) of that table: “DEU”, leading to this:

Name IOC Code New code FIFA A-3
Aadya VARIYATH GER DEU AFG AFG
Aaron BAI USA USA ALD ALA
Aaron CHEN NZL NZL ALB ALB
Aarya Gharge NED NLD ALG DZA

Once you have converted all the codes into ISO-3 ones, you might be tempted to delete the column with the old codes. Be careful: because the new codes have been deducted from the old ones, deleting that column would cause you to lose all your country codes. To avert this, select your good column and paste it on itself using Edit > Paste Special > Values. That way, you copy the values in the cells and lose all reference to how they were calculated in the first place.