A colleague of mine tests a product that helps big brands target and engage Hispanic customers in the US. As you can imagine, they use a lot of survey data as well as openly available data to build the analytics in their product. We do test the accuracy of the underlying algorithms. But the algorithms used are only going to be as good the underlying data itself. So, a big and often ignored challenge for testing such applications is the verifying the validity of the underlying data itself.
I set out to see if I could help in this regard. We could make sure the Hispanic and Latino population data displayed in the application was somewhat similar to the most recent census data available. This page on Wikipedia had the data I wanted. The rest of this post outlines the two methods I used to scrape the Wikipedia table using Python.
Why this post?
As testers, we sometimes need real (or realistic) data for testing. But it is really hard to find data in the format you need. Often, we need to go to the websites such as Wikipedia , extract the data and then do the formatting we need. Wikipedia tables have proven to be an excellent source of data for us. So I thought I would share a couple of methods to scrape Wikipedia tables using Python.
Generally speaking, there are two basic tasks for scraping table data from a web page. First, get the HTML source. Second, parse the HTML to locate the table data. The key to scraping is looking at the HTML, understanding the page structure and figuring out how you want to pull the data.
We are going to scrape the table data (image below) from the Wikipedia web page for Hispanic and Latino population in USA. This data includes States/Territory, population growth percentage details etc.,
Now that we know where our data is, we can start our coding. In this post I will show two methods to scrape the data:
a) Method 1: Use Wikipedia module and BeautifulSoup
b) Method 2: Use Pandas library
Note: If you have the option, I recommend the Pandas library because I found it magical!
Method 1: Use Wikipedia module and BeautifulSoup
Wikipedia module has functions for searching Wikipedia, getting article summaries, getting data like links and images from a page, and more. However, we couldn’t find a way to scrape table data using Wikipedia library. That’s why we used the Wikipedia module in combination with BeautifulSoup.
We used the Wikipedia library to search and get HTML source. We used Beautifulsoup to parse the HTML table. You will notice one difference between our implementation and other online solution: our implementation treats the table as a tree rather than as an orderly structure of
import wikipedia, re from BeautifulSoup import BeautifulSoup def get_hispanic_population_data(): "Get the details of hispanic and latino population by state/territory" wiki_search_string = "hispanic and latino population" wiki_page_title = "List of U.S. states by Hispanic and Latino population" wiki_table_caption = "Hispanic and Latino Population by state or territory" parsed_table_data =  search_results = wikipedia.search(wiki_search_string) for result in search_results: if wiki_page_title in result: my_page = wikipedia.page(result) #download the HTML source soup = BeautifulSoup(my_page.html()) #Using a simple regex to do 'caption contains string' table = soup.find('caption',text=re.compile(r'%s'%wiki_table_caption)).findParent('table') rows = table.findAll('tr') for row in rows: children = row.findChildren(recursive=False) row_text =  for child in children: clean_text = child.text #This is to discard reference/citation links clean_text = clean_text.split('[') #This is to clean the header row of the sort icons clean_text = clean_text.split(' ')[-1] clean_text = clean_text.strip() row_text.append(clean_text) parsed_table_data.append(row_text) return parsed_table_data #----START OF SCRIPT if __name__=="__main__": print ('Hispanic and Latino population data in the USA looks like this:\n\n') hispanic_population_data = get_hispanic_population_data() for row in hispanic_population_data: print ('|'.join(row))
Let’s take a look at the code to see how this all works:
Step1: Get the HTML source
We created a function and set up
wiki_table_caption variables. By using
wikipedia.page() method, we pull the HTML source based on the page title. The function returns the HTML of the page in the
Step2: Identify the table
Next, we pass this HTML to BeautifulSoup which turns it into a well-formatted DOM object. We are trying to extract table information about Hispanic and Latino Population details in the USA. With the help of BeautifulSoup’s
find() command and a simple regex, we identify the right table based on the table’s caption.
Step3: Extract the table data
Now that we identified the table that we need, we need to parse this table. We pull all the
tr elements from the table. We use BeautifulSoup’s
findChildren(recursive=False) method to find the immediate children of each row. Note that we do not bother identifying the columns using the
td. This is because the table is somewhat unstructured and contains images, links, line breaks and
th elements sprinkled in a few different places. In the above code, we have done some data cleaning to clean up few citation links and icons.
Great! Once you execute this code you should see below output –
Method 2: Use Pandas library
I suspected that there might be an easier way to do this sort of thing. And my instinct was right. The other way is, using Pandas’
read_html(). You can read HTML tables into a list of DataFrame objects. It finds the table element, does the parsing and creates a DataFrame. This function searches for
table elements and only for
th rows and
td elements within each
th element in the table. Below is a sample code to pass the HTML to pd.read_html().
import pandas as pd import wikipedia as wp #Get the html source html = wp.page("List of U.S. states by Hispanic and Latino population").html().encode("UTF-8") df = pd.read_html(html) df.to_csv('beautifulsoup_pandas.csv',header=0,index=False) print (df)
The output is –
After you call this function, you notice Pandas automatically fills in NaNs (Not a Number) for empty cells. I came away really impressed with how easy Pandas made parsing a relatively hard-to-parse table.
Note: This data may not be tidy enough or in the format needed by your project. You need to make it tidy before using it. You can refer to my other post Cleaning data with Python for some tips and tricks on cleaning data and restructuring the data.
If you are a startup finding it hard to hire technical QA engineers, learn more about Qxf2 Services.
I hope you have found this post useful for web scraping with Python. Here are few references I found useful:
1. Data Scraping: Good article explaining about how to get data from the web, Scraping websites, tools that help to scrape.
2. Using Pandas for Data scraping
3. Wikipedia Table data Scraping with Python and BeautifulSoupThis article shows you another way to use BeautifulSoup to scrape Wikipedia table data.