Scraping a Wikipedia table using Python

A colleague of mine tests a product that helps big brands target and engage Hispanic customers in the US. As you can imagine, they use a lot of survey data as well as openly available data to build the analytics in their product. We do test the accuracy of the underlying algorithms. But the algorithms used are only going to be as good the underlying data itself. So, a big and often ignored challenge for testing such applications is the verifying the validity of the underlying data itself.

I set out to see if I could help in this regard. We could make sure the Hispanic and Latino population data displayed in the application was somewhat similar to the most recent census data available. This page on Wikipedia had the data I wanted. The rest of this post outlines the two methods I used to scrape the Wikipedia table using Python.


Why this post?

As testers, we sometimes need real (or realistic) data for testing. But it is really hard to find data in the format you need. Often, we need to go to the websites such as Wikipedia , extract the data and then do the formatting we need. Wikipedia tables have proven to be an excellent source of data for us. So I thought I would share a couple of methods to scrape Wikipedia tables using Python.


Overview

Generally speaking, there are two basic tasks for scraping table data from a web page. First, get the HTML source. Second, parse the HTML to locate the table data. The key to scraping is looking at the HTML, understanding the page structure and figuring out how you want to pull the data.

We are going to scrape the table data (image below) from the Wikipedia web page for Hispanic and Latino population in USA. This data includes States/Territory, population growth percentage details etc.,

List of U.S. states by Hispanic and Latino population


Now that we know where our data is, we can start our coding. In this post I will show two methods to scrape the data:
a) Method 1: Use Wikipedia module and BeautifulSoup
b) Method 2: Use Pandas library

Note: If you have the option, I recommend the Pandas library because I found it magical!


Method 1: Use Wikipedia module and BeautifulSoup

Wikipedia module has functions for searching Wikipedia, getting article summaries, getting data like links and images from a page, and more. However, we couldn’t find a way to scrape table data using Wikipedia library. That’s why we used the Wikipedia module in combination with BeautifulSoup.

We used the Wikipedia library to search and get HTML source. We used Beautifulsoup to parse the HTML table. You will notice one difference between our implementation and other online solution: our implementation treats the table as a tree rather than as an orderly structure of tr and td elements.

import wikipedia, re
from BeautifulSoup import BeautifulSoup
 
def get_hispanic_population_data():
    "Get the details of hispanic and latino population by state/territory"
    wiki_search_string = "hispanic and latino population"
    wiki_page_title = "List of U.S. states by Hispanic and Latino population"
    wiki_table_caption = "Hispanic and Latino Population by state or territory"
    parsed_table_data = []
 
    search_results = wikipedia.search(wiki_search_string)
    for result in search_results:
        if wiki_page_title in result:
            my_page = wikipedia.page(result)
            #download the HTML source
            soup = BeautifulSoup(my_page.html())
            #Using a simple regex to do 'caption contains string'
            table = 
            soup.find('caption',text=re.compile(r'%s'%wiki_table_caption)).findParent('table')
            rows = table.findAll('tr')
            for row in rows:
                children = row.findChildren(recursive=False)
                row_text = []
                for child in children:
                    clean_text = child.text
                    #This is to discard reference/citation links
                    clean_text = clean_text.split('[')[0]
                    #This is to clean the header row of the sort icons
                    clean_text = clean_text.split(' ')[-1]
                    clean_text = clean_text.strip()
                    row_text.append(clean_text)
                parsed_table_data.append(row_text)
 
   return parsed_table_data
 
#----START OF SCRIPT
if __name__=="__main__":
    print ('Hispanic and Latino population data in the USA looks like this:\n\n')
    hispanic_population_data = get_hispanic_population_data() 
    for row in hispanic_population_data:
        print ('|'.join(row))

Let’s take a look at the code to see how this all works:

Step1: Get the HTML source
We created a function and set up wiki_search_string, wiki_page_title, wiki_table_caption variables. By using wikipedia.page() method, we pull the HTML source based on the page title. The function returns the HTML of the page in the my_page variable.

Step2: Identify the table
Next, we pass this HTML to BeautifulSoup which turns it into a well-formatted DOM object. We are trying to extract table information about Hispanic and Latino Population details in the USA. With the help of BeautifulSoup’s find() command and a simple regex, we identify the right table based on the table’s caption.

Step3: Extract the table data
Now that we identified the table that we need, we need to parse this table. We pull all the tr elements from the table. We use BeautifulSoup’s findChildren(recursive=False) method to find the immediate children of each row. Note that we do not bother identifying the columns using the td. This is because the table is somewhat unstructured and contains images, links, line breaks and th elements sprinkled in a few different places. In the above code, we have done some data cleaning to clean up few citation links and icons.

Great! Once you execute this code you should see below output –

Data Scraping - Output


Method 2: Use Pandas library

I suspected that there might be an easier way to do this sort of thing. And my instinct was right. The other way is, using Pandas’ read_html(). You can read HTML tables into a list of DataFrame objects. It finds the table element, does the parsing and creates a DataFrame. This function searches for table elements and only for tr and th rows and td elements within each tr or th element in the table. Below is a sample code to pass the HTML to pd.read_html().

import pandas as pd
import wikipedia as wp
 
#Get the html source
html = wp.page("List of U.S. states by Hispanic and Latino population").html().encode("UTF-8")
df = pd.read_html(html)[0]
df.to_csv('beautifulsoup_pandas.csv',header=0,index=False)
print (df)

The output is –

Data Scraping using pandas

After you call this function, you notice Pandas automatically fills in NaNs (Not a Number) for empty cells. I came away really impressed with how easy Pandas made parsing a relatively hard-to-parse table.

Note: This data may not be tidy enough or in the format needed by your project. You need to make it tidy before using it. You can refer to my other post Cleaning data with Python for some tips and tricks on cleaning data and restructuring the data.

If you are a startup finding it hard to hire technical QA engineers, learn more about Qxf2 Services.


References:

I hope you have found this post useful for web scraping with Python. Here are few references I found useful:
1. Data Scraping: Good article explaining about how to get data from the web, Scraping websites, tools that help to scrape.
2. Using Pandas for Data scraping
3. Wikipedia Table data Scraping with Python and BeautifulSoupThis article shows you another way to use BeautifulSoup to scrape Wikipedia table data.


Subscribe to our weekly Newsletter


View a sample



Indira Nellutla

I am an engineer with about 10 years of experience in software development and testing. Prior to joining Qxf2, I worked with reputed companies like Dell and Infosys. They helped me gain good experience in the manufacturing and healthcare domains. My QA career started at Dell. I got exposure to various testing tools, processes and methodologies, got an opportunity to work on various platforms like JMS Hermes, SOAP UI, data integration, queues, etc. I chose Qxf2 because it allowed remote working. My interests are vegetable gardening using organic methods, listening to music and reading books.

3 Comments

  1. manu said:

    Hi can you give me a clue how to extract these types of data from a result sheet. myself manu working in pvt college i have to do result analysis for the college. taught of programming it. can you please help in this regard

    Register No. 15BGS85130 Name :ABHISHEK K ! !
    Paper Code : SDC6SB SM1C61 SM1C62 SM1P61 SM1P62 SP1C61 SP1C62 SP1P61 SP1P62 SS2C61 SS2C62 SS2P61 SS2P62 ! !
    Total Marks: 25 7 0 A A 5 4 A A 0 3 A A ! 44 !
    I.A.Marks : 15 10 2 0 0 15 15 8 8 4 3 0 0 ! 80 !RE-
    ! !APPEAR
    ———————————————————————————————————————————-
    Register No. 15BGS85131 Name :ANEESH R FIRST CLASS EXEMPLARY ! !
    Paper Code : SDC6SB SM1C61 SM1C62 SM1P61 SM1P62 SP1C61 SP1C62 SP1P61 SP1P62 SS2C61 SS2C62 SS2P61 SS2P62 ! !164
    Total Marks: 45 56 41 35 31 65 55 32 32 43 37 34 34 !540 !8.2GPA
    I.A.Marks : 25 26 25 12 15 25 24 14 14 24 26 15 15 !260 !A+
    ! !GRADE
    ———————————————————————————————————————————-
    Register No. 15BGS85132 Name :ARCHANA K FIRST CLASS EXEMPLARY ! !
    Paper Code : SDC6SB SM1C61 SM1C62 SM1P61 SM1P62 SP1C61 SP1C62 SP1P61 SP1P62 SS2C61 SS2C62 SS2P61 SS2P62 ! !167
    Total Marks: 43 61 59 35 35 34 46 35 35 52 40 35 35 !545 !8.35GPA
    I.A.Marks : 25 29 22 15 15 26 26 15 15 30 30 15 15 !278 !A+
    ! !GRADE
    ———————————————————————————————————————————-
    Register No. 15BGS85133 Name :KIRAN U R ! !
    Paper Code : SDC6SB SM1C61 SM1C62 SM1P61 SM1P62 SP1C61 SP1C62 SP1P61 SP1P62 SS2C61 SS2C62 SS2P61 SS2P62 ! !
    Total Marks: 44 12 5 34 31 8 * 35 35 34 41 33 33 ! !
    I.A.Marks : 27 21 21 15 14 23 23 15 15 20 19 15 15 ! !N.P.
    ! !

    my email id is manu12384@gmail.com

    August 22, 2018
    Reply
  2. Isabaala said:

    Hello,
    I’m new with python and I’m trying to extract some information about currencies but in Portuguese and Spanish wikipedia pages, to try to use for it for text mining. I tried what you did but I don’t have the right page.

    October 24, 2019
    Reply
    • Rohan Joshi Rohan Joshi said:

      Hi Isabaala ,

      Can you provide us the links of the pages you are trying to scrape and what output are you currently getting

      October 24, 2019
      Reply

Leave a Reply

Your email address will not be published.