{"id":5999,"date":"2017-06-12T23:56:21","date_gmt":"2017-06-13T03:56:21","guid":{"rendered":"https:\/\/qxf2.com\/blog\/?p=5999"},"modified":"2018-04-02T10:39:34","modified_gmt":"2018-04-02T14:39:34","slug":"web-scraping-using-python","status":"publish","type":"post","link":"https:\/\/qxf2.com\/blog\/web-scraping-using-python\/","title":{"rendered":"Scraping a Wikipedia table using Python"},"content":{"rendered":"<p>A colleague of mine tests a product that helps big brands target and engage Hispanic customers in the US. As you can imagine, they use a lot of survey data as well as openly available data to build the analytics in their product. We do test the accuracy of the underlying algorithms. But the algorithms used are only going to be as good the underlying data itself. So, a big and often ignored challenge for testing such applications is the verifying the validity of the underlying data itself. <\/p>\n<p>I set out to see if I could help in this regard. We could make sure the Hispanic and Latino population data displayed in the application was somewhat similar to the most recent census data available. <a href=\"https:\/\/en.wikipedia.org\/wiki\/List_of_U.S._states_by_Hispanic_and_Latino_population\">This page<\/a> on Wikipedia had the data I wanted. The rest of this post outlines the two methods I used to scrape the Wikipedia table using Python. <\/p>\n<hr>\n<h3>Why this post?<\/h3>\n<p>As testers, we sometimes need real (or realistic) data for testing. But it is really hard to find data in the format you need. Often, we need to go to the websites such as <a href=\"https:\/\/en.wikipedia.org\/wiki\/Main_Page\">Wikipedia <\/a>, extract the data and then do the formatting we need. Wikipedia tables have proven to be an excellent source of data for <a href=\"https:\/\/www.qxf2.com\/?utm_source=wiki_table&#038;utm_medium=click&#038;utm_campaign=From%2520blog\">us<\/a>. So I thought I would share a couple of methods to scrape Wikipedia tables using Python. <\/p>\n<hr>\n<h3>Overview<\/h3>\n<p>Generally speaking, there are two basic tasks for scraping table data from a web page. First, get the HTML source. Second, parse the HTML to locate the table data. The key to scraping is looking at the HTML, understanding the page structure and figuring out how you want to pull the data.<\/p>\n<p>We are going to scrape the table data (image below) from the Wikipedia web page for <a href=\"https:\/\/en.wikipedia.org\/wiki\/List_of_U.S._states_by_Hispanic_and_Latino_population\">Hispanic and Latino population in USA<\/a>. This data includes States\/Territory, population growth percentage details etc., <\/p>\n<p><a href=\"https:\/\/qxf2.com\/blog\/wp-content\/uploads\/2017\/05\/datascraping_lists.png\" data-rel=\"lightbox-image-0\" data-rl_title=\"\" data-rl_caption=\"\" title=\"\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-6029\" src=\"https:\/\/qxf2.com\/blog\/wp-content\/uploads\/2017\/05\/datascraping_lists.png\" alt=\"List of U.S. states by Hispanic and Latino population\" width=\"651\" height=\"332\" srcset=\"https:\/\/qxf2.com\/blog\/wp-content\/uploads\/2017\/05\/datascraping_lists.png 651w, https:\/\/qxf2.com\/blog\/wp-content\/uploads\/2017\/05\/datascraping_lists-300x153.png 300w\" sizes=\"auto, (max-width: 651px) 100vw, 651px\" \/><\/a><\/p>\n<hr>\n<p>Now that we know where our data is, we can start our coding. In this post I will show two methods to scrape the data:<br \/>\na) Method 1: Use <a href=\"https:\/\/pypi.python.org\/pypi\/wikipedia\/\">Wikipedia module<\/a> and <a href=\"https:\/\/pypi.python.org\/pypi\/BeautifulSoup\">BeautifulSoup<\/a><br \/>\nb) Method 2: Use <a href=\"https:\/\/pypi.python.org\/pypi\/pandas\">Pandas<\/a> library<\/p>\n<p><strong>Note:<\/strong> If you have the option, I recommend the Pandas library because I found it magical!<\/p>\n<hr>\n<h3>Method 1: Use Wikipedia module and BeautifulSoup<\/h3>\n<p><a href=\"https:\/\/pypi.python.org\/pypi\/wikipedia\/\">Wikipedia module<\/a> has functions for searching Wikipedia, getting article summaries, getting data like links and images from a page, and more. However, we couldn&#8217;t find a way to scrape table data using Wikipedia library. That\u2019s why we used the Wikipedia module in combination with BeautifulSoup. <\/p>\n<p>We used the Wikipedia library to search and get HTML source. We used Beautifulsoup to parse the HTML table. You will notice one difference between our implementation and other online solution: our implementation treats the table as a tree rather than as an orderly structure of <code>tr<\/code> and <code>td<\/code> elements. <\/p>\n<pre lang=\"Python\">\r\nimport wikipedia, re\r\nfrom BeautifulSoup import BeautifulSoup\r\n\r\ndef get_hispanic_population_data():\r\n    \"Get the details of hispanic and latino population by state\/territory\"\r\n    wiki_search_string = \"hispanic and latino population\"\r\n    wiki_page_title = \"List of U.S. states by Hispanic and Latino population\"\r\n    wiki_table_caption = \"Hispanic and Latino Population by state or territory\"\r\n    parsed_table_data = []\r\n\r\n    search_results = wikipedia.search(wiki_search_string)\r\n    for result in search_results:\r\n        if wiki_page_title in result:\r\n            my_page = wikipedia.page(result)\r\n            #download the HTML source\r\n            soup = BeautifulSoup(my_page.html())\r\n            #Using a simple regex to do 'caption contains string'\r\n            table = \r\n            soup.find('caption',text=re.compile(r'%s'%wiki_table_caption)).findParent('table')\r\n            rows = table.findAll('tr')\r\n            for row in rows:\r\n                children = row.findChildren(recursive=False)\r\n                row_text = []\r\n                for child in children:\r\n                    clean_text = child.text\r\n                    #This is to discard reference\/citation links\r\n                    clean_text = clean_text.split('&#091;')[0]\r\n                    #This is to clean the header row of the sort icons\r\n                    clean_text = clean_text.split('&#160;')[-1]\r\n                    clean_text = clean_text.strip()\r\n                    row_text.append(clean_text)\r\n                parsed_table_data.append(row_text)\r\n\r\n   return parsed_table_data\r\n            \r\n#----START OF SCRIPT\r\nif __name__==\"__main__\":\r\n    print ('Hispanic and Latino population data in the USA looks like this:\\n\\n')\r\n    hispanic_population_data = get_hispanic_population_data() \r\n    for row in hispanic_population_data:\r\n        print ('|'.join(row))\r\n<\/pre>\n<p>Let\u2019s take a look at the code to see how this all works:<\/p>\n<p><strong>Step1: Get the HTML source<\/strong><br \/>\nWe created a function and set up <code>wiki_search_string<\/code>, <code>wiki_page_title<\/code>, <code>wiki_table_caption<\/code> variables. By using <code>wikipedia.page()<\/code> method, we pull the HTML source based on the page title. The function returns the HTML of the page in the <code>my_page<\/code> variable.<\/p>\n<p><strong>Step2: Identify the table<\/strong><br \/>\nNext, we pass this HTML to BeautifulSoup which turns it into a well-formatted DOM object. We are trying to extract table information about Hispanic and Latino Population details in the USA. With the help of BeautifulSoup\u2019s <code>find()<\/code> command and a simple regex, we identify the right table based on the table&#8217;s caption. <\/p>\n<p><strong>Step3: Extract the table data<\/strong><br \/>\nNow that we identified the table that we need, we need to parse this table. We pull all the <code>tr<\/code> elements from the table. We use BeautifulSoup&#8217;s <code>findChildren(recursive=False)<\/code> method to find the <u>immediate children<\/u> of each row. Note that we do not bother identifying the columns using the <code>td<\/code>. This is because the table is somewhat unstructured and contains images, links, line breaks and <code>th<\/code> elements sprinkled in a few different places. In the above code, we have done some data cleaning to clean up few citation links and icons.<\/p>\n<p>Great! Once you execute this code you should see below output &#8211;<\/p>\n<p><a href=\"https:\/\/qxf2.com\/blog\/wp-content\/uploads\/2017\/06\/datascraping_output.png\" data-rel=\"lightbox-image-1\" data-rl_title=\"\" data-rl_caption=\"\" title=\"\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/qxf2.com\/blog\/wp-content\/uploads\/2017\/06\/datascraping_output.png\" alt=\"Data Scraping - Output\" width=\"749\" height=\"212\" class=\"aligncenter size-full wp-image-6055\" srcset=\"https:\/\/qxf2.com\/blog\/wp-content\/uploads\/2017\/06\/datascraping_output.png 749w, https:\/\/qxf2.com\/blog\/wp-content\/uploads\/2017\/06\/datascraping_output-300x85.png 300w\" sizes=\"auto, (max-width: 749px) 100vw, 749px\" \/><\/a><\/p>\n<hr>\n<h3>Method 2: Use Pandas library<\/h3>\n<p>I suspected that there might be an easier way to do this sort of thing. And my instinct was right. The other way is,   using Pandas&#8217; <code>read_html()<\/code>. You can read HTML tables into a list of DataFrame objects. It finds the table element, does the parsing and creates a DataFrame. This function searches for <code>table<\/code> elements and only for <code>tr<\/code> and <code>th<\/code> rows and <code>td<\/code> elements within each <code>tr<\/code> or <code>th<\/code> element in the table. Below is a sample code to pass the HTML to pd.read_html().<\/p>\n<pre lang=\"Python\">\r\nimport pandas as pd\r\nimport wikipedia as wp\r\n   \r\n#Get the html source\r\nhtml = wp.page(\"List of U.S. states by Hispanic and Latino population\").html().encode(\"UTF-8\")\r\ndf = pd.read_html(html)[0]\r\ndf.to_csv('beautifulsoup_pandas.csv',header=0,index=False)\r\nprint (df)\r\n<\/pre>\n<p>The output is &#8211;<\/p>\n<p><a href=\"https:\/\/qxf2.com\/blog\/wp-content\/uploads\/2017\/06\/datascraping_pandas.png\" data-rel=\"lightbox-image-2\" data-rl_title=\"\" data-rl_caption=\"\" title=\"\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/qxf2.com\/blog\/wp-content\/uploads\/2017\/06\/datascraping_pandas.png\" alt=\"Data Scraping using pandas\" width=\"893\" height=\"222\" class=\"aligncenter size-full wp-image-6056\" srcset=\"https:\/\/qxf2.com\/blog\/wp-content\/uploads\/2017\/06\/datascraping_pandas.png 893w, https:\/\/qxf2.com\/blog\/wp-content\/uploads\/2017\/06\/datascraping_pandas-300x75.png 300w, https:\/\/qxf2.com\/blog\/wp-content\/uploads\/2017\/06\/datascraping_pandas-768x191.png 768w\" sizes=\"auto, (max-width: 893px) 100vw, 893px\" \/><\/a><\/p>\n<p>After you call this function, you notice Pandas automatically fills in NaNs (Not a Number) for empty cells. I came away really impressed with how easy Pandas made parsing a relatively hard-to-parse table. <\/p>\n<p><strong>Note:<\/strong> This data may not be tidy enough or in the format needed by your project. You need to make it tidy before using it. You can refer to my other post <a href=\"https:\/\/qxf2.com\/blog\/cleaning-data-python-pandas\/\">Cleaning data with Python<\/a> for some tips and tricks on cleaning data and restructuring the data.<\/p>\n<p><strong>If you are a startup finding it hard to hire technical QA engineers, learn more <a href=\"https:\/\/qxf2.com\/blog\/about-qxf2\/\">about Qxf2 Services<\/a>.<\/strong><\/p>\n<hr>\n<h3>References:<\/h3>\n<p>I hope you have found this post useful for web scraping with Python. Here are few references I found useful:<br \/>\n1. <a href=\"http:\/\/datajournalismhandbook.org\/1.0\/en\/getting_data_3.html\">Data Scraping<\/a>: Good article explaining about how to get data from the web, Scraping websites, tools that help to scrape.<br \/>\n2. <a href= \"https:\/\/stackoverflow.com\/questions\/36070768\/using-beautifulsoup-to-parse-table\">Using Pandas for Data scraping<\/a><br \/>\n3. <a href=\"https:\/\/themeparkanalysis.com\/2016\/12\/08\/getting-disney-ride-lists-from-wikipedia-using-python-and-beautifulsoup\/\">Wikipedia Table data Scraping with Python and BeautifulSoup<\/a>This article shows you another way to use BeautifulSoup to scrape Wikipedia table data.<\/p>\n<hr>\n<script>(function() {\n\twindow.mc4wp = window.mc4wp || {\n\t\tlisteners: [],\n\t\tforms: {\n\t\t\ton: function(evt, cb) {\n\t\t\t\twindow.mc4wp.listeners.push(\n\t\t\t\t\t{\n\t\t\t\t\t\tevent   : evt,\n\t\t\t\t\t\tcallback: cb\n\t\t\t\t\t}\n\t\t\t\t);\n\t\t\t}\n\t\t}\n\t}\n})();\n<\/script><!-- Mailchimp for WordPress v4.12.5 - https:\/\/wordpress.org\/plugins\/mailchimp-for-wp\/ --><form id=\"mc4wp-form-1\" class=\"mc4wp-form mc4wp-form-6165 mc4wp-form-theme mc4wp-form-theme-blue\" method=\"post\" data-id=\"6165\" data-name=\"Newsletter\" ><div class=\"mc4wp-form-fields\"><div style=\"border:3px; border-style:dashed;border-color:#56d1e1;padding:1.2em;\">\r\n  <h1 style=\"text-align: center; padding-top: 20px; padding-bottom: 20px; color: #592b1b;\">Subscribe to our weekly Newsletter<\/h1>\r\n  <input style=\"margin: auto;\" type=\"email\" name=\"EMAIL\" placeholder=\"Your email address\" required \/>\r\n  <br>\r\n  <p style=\"text-align: center;\">\r\n    <input style=\"background-color: #890c06 !important; border-color: #890c06;\" type=\"submit\" value=\"Sign up\" \/>\r\n    \r\n  <\/p>\r\n  <p style=\"text-align: center;\">\r\n    <a href=\"http:\/\/mailchi.mp\/c9c7b81ddf13\/the-informed-testers-newsletter-20-oct-2017\"><small>View a sample<\/small><\/a>\r\n  <\/p>\r\n  <br>\r\n<\/div><\/div><label style=\"display: none !important;\">Leave this field empty if you're human: <input type=\"text\" name=\"_mc4wp_honeypot\" value=\"\" tabindex=\"-1\" autocomplete=\"off\" \/><\/label><input type=\"hidden\" name=\"_mc4wp_timestamp\" value=\"1779907269\" \/><input type=\"hidden\" name=\"_mc4wp_form_id\" value=\"6165\" \/><input type=\"hidden\" name=\"_mc4wp_form_element_id\" value=\"mc4wp-form-1\" \/><div class=\"mc4wp-response\"><\/div><\/form><!-- \/ Mailchimp for WordPress Plugin -->\n<hr>\n","protected":false},"excerpt":{"rendered":"<p>A colleague of mine tests a product that helps big brands target and engage Hispanic customers in the US. As you can imagine, they use a lot of survey data as well as openly available data to build the analytics in their product. We do test the accuracy of the underlying algorithms. But the algorithms used are only going to [&hellip;]<\/p>\n","protected":false},"author":16,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[135,130,132,18],"tags":[],"class_list":["post-5999","post","type-post","status-publish","format-standard","hentry","category-data-scraping","category-machine-learning","category-pandas","category-python"],"_links":{"self":[{"href":"https:\/\/qxf2.com\/blog\/wp-json\/wp\/v2\/posts\/5999","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/qxf2.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/qxf2.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/qxf2.com\/blog\/wp-json\/wp\/v2\/users\/16"}],"replies":[{"embeddable":true,"href":"https:\/\/qxf2.com\/blog\/wp-json\/wp\/v2\/comments?post=5999"}],"version-history":[{"count":65,"href":"https:\/\/qxf2.com\/blog\/wp-json\/wp\/v2\/posts\/5999\/revisions"}],"predecessor-version":[{"id":15584,"href":"https:\/\/qxf2.com\/blog\/wp-json\/wp\/v2\/posts\/5999\/revisions\/15584"}],"wp:attachment":[{"href":"https:\/\/qxf2.com\/blog\/wp-json\/wp\/v2\/media?parent=5999"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/qxf2.com\/blog\/wp-json\/wp\/v2\/categories?post=5999"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/qxf2.com\/blog\/wp-json\/wp\/v2\/tags?post=5999"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}