Extracting data from PDFs using Python

When testing highly data dependent products, I find it very useful to use data published by governments. When government organizations publish data online, barring a few notable exceptions, it usually releases it as a series of PDFs. The PDF file format was not designed to hold structured data, which makes extracting data from PDFs difficult. In this post, I will show you a couple of ways to extract text and table data from PDF file using Python and write it into a CSV or Excel file.


We will take an example of US census data for the Hispanic Population for 2010. If you look at the content of the PDF, you can see that there is a lot of text data, table data, graphs, maps etc. I will extract the table data for Hispanic or Latino Origin Population by Type: 2000 and 2010 from Page 3 of the PDF file.

For achieving this, I first tried using PyPDF2 (for extracting) and PDFtables (for converting PDF tables to Excel/CSV). It did serve my requirement but PDFtables.com is paid service.

Later I came across PDFMiner and started exploring it for extracting data using its pdf2txt.py script. I liked this solution much better and I am using it for my work.


Method 1: Extract the Pages with Tables using PyPDF2 and PDFTables

When I Googled around for ‘Python read pdf’, PyPDF2 was the first tool I stumbled upon. PyPDF2 can extract data from PDF files and manipulate existing PDFs to produce a new file. After spending a little time with it, I realized PyPDF2 does not have a way to extract images, charts, or other media from PDF documents. But it can extract text and return it as a Python string. Reading a PDF document is pretty simple and straight forward. I used PdfFileReader() and PdfFileWriter() classes for reading and writing the table data.

import PyPDF2
 
PDFfilename = "hispanic.pdf" #filename of your PDF/directory where your PDF is stored
 
pfr = PyPDF2.PdfFileReader(open(PDFfilename, "rb")) #PdfFileReader object

Firstly, I installed PyPDF2 library and imported it, created an instance of the PdfFileReader Class, which stores information about the PDF (number of pages, text on pages, etc). In this PDF, the table which I need extract is in Page 3. To extract this page, I used below code:-

pg3 = pfr.getPage(2) #extract pg 2
writer = PyPDF2.PdfFileWriter() #create PdfFileWriter object
 
#add pages
writer.addPage(pg3)
 
#filename of your PDF/directory where you want your new PDF to be
NewPDFfilename = "hispanic_tables.pdf" 
 
with open(NewPDFfilename, "wb") as outputStream: #create new PDF
    writer.write(outputStream) #write pages to new PDF

I used the .getPage() method, with the page number + 1 as the parameter (pages start at 0), on PdfFileReader object. After that, I created a PdfFileWriter object, which will eventually write a new PDF and add the pages to it. The purpose of writing this page with tables into separate pdf file is that I used PDFTables for extracting data. PDFTables puts everything (not just tables) in the PDF document into the output Excel or CSV, to avoid having a lot of junk data in the Excel I created a separate PDF with just the table that I want to extract.

PyPDF2 library extracts the text from a PDF document very nicely. The problem with this is that if there are tables in the document, the text in the tables is extracted in-line with the rest of the document text. This can be problematic because it produces sections of text that aren’t useful and look confusing (for instance, lots of numbers mashed together)

Writing the Table Data to a Excel using PDFTables
Now that I have a PDF with all of the table data that I need, I can now use PDFTables to write the table data to an Excel/CSV file. The PDFTables package extracts tables from PDF files and allows the user to convert PDF tables to formats (CSV, XLM, or XLSX). It provides us with an API key using which we can post a request to the PDFTables website to get the table extraction. You can get an API key by creating an account on the site for a free trial (PDFtables.com is paid, getting an API Key is restricted to certain pages only). With this free trial, I was able to upload this pdf and write the response to an excel. This served my purpose, but since PDFTables.com is paid I moved on exploring other tools for data extraction.


Method 2: PDFMiner for extracting text data from PDFs

I came across a great Python-based solution to extract the text from a PDF is PDFMiner. PDFMiner has two command-line scripts namely pdf2txt.py (to extract text and images) and dumpdf.py (find objects and their coordinates). I used pdf2txt.py script to extract the pdf content to HTML format using below command.

pdf2txt.py -O myoutput -o myoutput/hispanic.html -t html -p 3 hispanic.pdf

Below is list of options which can be used with pdf2txt.py
Options:

  • -o output file name
  • -p comma-separated list of page numbers to extract
  • -t output format (text/html/xml/tag[for Tagged PDFs])
  • -O dirname (triggers extraction of images from PDF into directory)
  • -P password

The above command can be used to convert a PDF to HTML or XML. After installing PDFMiner, cd into the directory where the PDF file is located and ran the above command. The resulting file will be ‘hispanic.html’ which has the 3rd page from the PDF. Reading data from HTML can be done using Beautiful Soup. It is a powerful Python library for extracting data from XML and HTML files. I used BeautifulSoup for reading and extracting the data from hispanic.html. You can refer to my previous post on Data scraping using python for extracting table data from html and writing into a csv file. I wrote a quick script that will extract table data from web page using Wikipedia module and BeautifulSoup.


In this way, I used PDFMiner and PyPDF2 to extract the data, but you’ll still have to make a choice when deciding which to use and learn. Both libraries are in active development and the developers are dedicated to providing good code. There are several tools you can use to get what you need from them, and Python enables to get inside and scrape, split, merge, delete, and crop just about whatever you find.


In this post, I tried to showcase different approaches with few code snippets which I implemented in our requirement for extracting table data from PDF file by providing. I hope you will like it!

If you are a startup finding it hard to hire technical QA engineers, learn more about Qxf2 Services.


References:-

1) Manipulating PDFs with python and PyPDF2

2) Working with pdf file in python

3) Different PDF tools to extract text and data from pdfs


56 thoughts on “Extracting data from PDFs using Python

  1. I have a PDF where it contains many contracts . I need a way to find all the contracts and list them out . Can any one suggest the way how it can be achieved.

    1. Hi,
      As mentioned in the blog, We suggest that you use PDFMiner and convert the contract pages in the source PDF into an HTML. Post that write a Python script using Beautiful Soup Python library, to manipulate the data in the HTML (ie. in your case to list out the contracts).

  2. I want to extract the text from pdf that contains table into the csv file what are the possible ways to extract different table formats data to csv file .

  3. hello, i have a list of 400 pdf files which contains text and images, i want to extract only text and id of each pdf file which is mentioned in the pdf and and want out put in excel form which shows only two columns one of id and second of description which is the whole text in the pdf file. if you could help me finding code in python, it would be great help for me
    thank you .

  4. Hi, Can someone help me to satisfy below requirement:
    Wanna to extract individual table values from PDF. Later the values(contains both string & integer) can be compared with some conditions, if the condition satisfied the pdf files will be moved to another folder else the pdf should be discarded.

    1. Hi Madhan,
      The pdf content can be extracted via two ways either using PyPDF2 & PDFTables or using PDFMiner which are explained in this blog. I hope the blog is very clear in explaining how to extract data from the pdf file. After extracting the content, you can complete your requirement by comparing and moving the files. 
      Hope this helps.

  5. Hi,
    when i ran the code i am getting below error. can you please help me out here ?
    if len(filters) == 1 and filters[0][0] in LITERALS_DCT_DECODE:
    TypeError: object of type ‘zip’ has no len()

  6. I have form 16 or form 16a.
    I want to extract numbers from those pages when uploaded on web and save them in database

      1. This blog shows the ways to extract data from pdf. Kindly try with numbers too. You can refer to the references link regarding tools to extract provided at the bottom of the blog.

    1. Hi, the method listed in the blog is applicable to any generic data. Kindly try it with your PDF file.

Leave a Reply

Your email address will not be published.