When testing highly data dependent products, I find it very useful to use data published by governments. When government organizations publish data online, barring a few notable exceptions, it usually releases it as a series of PDFs. The PDF file format was not designed to hold structured data, which makes extracting data from PDFs difficult. In this post, I will show you a couple of ways to extract text and table data from PDF file using Python and write it into a CSV or Excel file.
We will take an example of US census data for the Hispanic Population for 2010. If you look at the content of the PDF, you can see that there is a lot of text data, table data, graphs, maps etc. I will extract the table data for Hispanic or Latino Origin Population by Type: 2000 and 2010 from Page 3 of the PDF file.
Later I came across PDFMiner and started exploring it for extracting data using its pdf2txt.py script. I liked this solution much better and I am using it for my work.
Method 1: Extract the Pages with Tables using PyPDF2 and PDFTables
When I Googled around for ‘Python read pdf’, PyPDF2 was the first tool I stumbled upon. PyPDF2 can extract data from PDF files and manipulate existing PDFs to produce a new file. After spending a little time with it, I realized PyPDF2 does not have a way to extract images, charts, or other media from PDF documents. But it can extract text and return it as a Python string. Reading a PDF document is pretty simple and straight forward. I used PdfFileReader() and PdfFileWriter() classes for reading and writing the table data.
import PyPDF2 PDFfilename = "hispanic.pdf" #filename of your PDF/directory where your PDF is stored pfr = PyPDF2.PdfFileReader(open(PDFfilename, "rb")) #PdfFileReader object
Firstly, I installed PyPDF2 library and imported it, created an instance of the PdfFileReader Class, which stores information about the PDF (number of pages, text on pages, etc). In this PDF, the table which I need extract is in Page 3. To extract this page, I used below code:-
pg3 = pfr.getPage(2) #extract pg 2 writer = PyPDF2.PdfFileWriter() #create PdfFileWriter object #add pages writer.addPage(pg3) #filename of your PDF/directory where you want your new PDF to be NewPDFfilename = "hispanic_tables.pdf" with open(NewPDFfilename, "wb") as outputStream: #create new PDF writer.write(outputStream) #write pages to new PDF
I used the .getPage() method, with the page number + 1 as the parameter (pages start at 0), on PdfFileReader object. After that, I created a PdfFileWriter object, which will eventually write a new PDF and add the pages to it. The purpose of writing this page with tables into separate pdf file is that I used PDFTables for extracting data. PDFTables puts everything (not just tables) in the PDF document into the output Excel or CSV, to avoid having a lot of junk data in the Excel I created a separate PDF with just the table that I want to extract.
PyPDF2 library extracts the text from a PDF document very nicely. The problem with this is that if there are tables in the document, the text in the tables is extracted in-line with the rest of the document text. This can be problematic because it produces sections of text that aren’t useful and look confusing (for instance, lots of numbers mashed together)
Writing the Table Data to a Excel using PDFTables
Now that I have a PDF with all of the table data that I need, I can now use PDFTables to write the table data to an Excel/CSV file. The PDFTables package extracts tables from PDF files and allows the user to convert PDF tables to formats (CSV, XLM, or XLSX). It provides us with an API key using which we can post a request to the PDFTables website to get the table extraction. You can get an API key by creating an account on the site for a free trial (PDFtables.com is paid, getting an API Key is restricted to certain pages only). With this free trial, I was able to upload this pdf and write the response to an excel. This served my purpose, but since PDFTables.com is paid I moved on exploring other tools for data extraction.
Method 2: PDFMiner for extracting text data from PDFs
I came across a great Python-based solution to extract the text from a PDF is PDFMiner. PDFMiner has two command-line scripts namely pdf2txt.py (to extract text and images) and dumpdf.py (find objects and their coordinates). I used pdf2txt.py script to extract the pdf content to HTML format using below command.
pdf2txt.py -O myoutput -o myoutput/hispanic.html -t html -p 3 hispanic.pdf
Below is list of options which can be used with pdf2txt.py
- -o output file name
- -p comma-separated list of page numbers to extract
- -t output format (text/html/xml/tag[for Tagged PDFs])
- -O dirname (triggers extraction of images from PDF into directory)
- -P password
The above command can be used to convert a PDF to HTML or XML. After installing PDFMiner, cd into the directory where the PDF file is located and ran the above command. The resulting file will be ‘hispanic.html’ which has the 3rd page from the PDF. Reading data from HTML can be done using Beautiful Soup. It is a powerful Python library for extracting data from XML and HTML files. I used BeautifulSoup for reading and extracting the data from hispanic.html. You can refer to my previous post on Data scraping using python for extracting table data from html and writing into a csv file. I wrote a quick script that will extract table data from web page using Wikipedia module and BeautifulSoup.
In this way, I used PDFMiner and PyPDF2 to extract the data, but you’ll still have to make a choice when deciding which to use and learn. Both libraries are in active development and the developers are dedicated to providing good code. There are several tools you can use to get what you need from them, and Python enables to get inside and scrape, split, merge, delete, and crop just about whatever you find.
In this post, I tried to showcase different approaches with few code snippets which I implemented in our requirement for extracting table data from PDF file by providing. I hope you will like it!
If you are a startup finding it hard to hire technical QA engineers, learn more about Qxf2 Services.
I am an engineer with about 10 years of experience in software development and testing. Prior to joining Qxf2, I worked with reputed companies like Dell and Infosys. They helped me gain good experience in the manufacturing and healthcare domains. My QA career started at Dell. I got exposure to various testing tools, processes and methodologies, got an opportunity to work on various platforms like JMS Hermes, SOAP UI, data integration, queues, etc. I chose Qxf2 because it allowed remote working. My interests are vegetable gardening using organic methods, listening to music and reading books.