Extracting data from PDFs using Python

When testing highly data dependent products, I find it very useful to use data published by governments. When government organizations publish data online, barring a few notable exceptions, it usually releases it as a series of PDFs. The PDF file format was not designed to hold structured data, which makes extracting data from PDFs difficult. In this post, I will show you a couple of ways to extract text and table data from PDF file using Python and write it into a CSV or Excel file.


We will take an example of US census data for the Hispanic Population for 2010. If you look at the content of the PDF, you can see that there is a lot of text data, table data, graphs, maps etc. I will extract the table data for Hispanic or Latino Origin Population by Type: 2000 and 2010 from Page 3 of the PDF file.

For achieving this, I first tried using PyPDF2 (for extracting) and PDFtables (for converting PDF tables to Excel/CSV). It did serve my requirement but PDFtables.com is paid service.

Later I came across PDFMiner and started exploring it for extracting data using its pdf2txt.py script. I liked this solution much better and I am using it for my work.


Method 1: Extract the Pages with Tables using PyPDF2 and PDFTables

When I Googled around for ‘Python read pdf’, PyPDF2 was the first tool I stumbled upon. PyPDF2 can extract data from PDF files and manipulate existing PDFs to produce a new file. After spending a little time with it, I realized PyPDF2 does not have a way to extract images, charts, or other media from PDF documents. But it can extract text and return it as a Python string. Reading a PDF document is pretty simple and straight forward. I used PdfFileReader() and PdfFileWriter() classes for reading and writing the table data.

import PyPDF2
 
PDFfilename = "hispanic.pdf" #filename of your PDF/directory where your PDF is stored
 
pfr = PyPDF2.PdfFileReader(open(PDFfilename, "rb")) #PdfFileReader object

Firstly, I installed PyPDF2 library and imported it, created an instance of the PdfFileReader Class, which stores information about the PDF (number of pages, text on pages, etc). In this PDF, the table which I need extract is in Page 3. To extract this page, I used below code:-

pg3 = pfr.getPage(2) #extract pg 2
writer = PyPDF2.PdfFileWriter() #create PdfFileWriter object
 
#add pages
writer.addPage(pg3)
 
#filename of your PDF/directory where you want your new PDF to be
NewPDFfilename = "hispanic_tables.pdf" 
 
with open(NewPDFfilename, "wb") as outputStream: #create new PDF
    writer.write(outputStream) #write pages to new PDF

I used the .getPage() method, with the page number + 1 as the parameter (pages start at 0), on PdfFileReader object. After that, I created a PdfFileWriter object, which will eventually write a new PDF and add the pages to it. The purpose of writing this page with tables into separate pdf file is that I used PDFTables for extracting data. PDFTables puts everything (not just tables) in the PDF document into the output Excel or CSV, to avoid having a lot of junk data in the Excel I created a separate PDF with just the table that I want to extract.

PyPDF2 library extracts the text from a PDF document very nicely. The problem with this is that if there are tables in the document, the text in the tables is extracted in-line with the rest of the document text. This can be problematic because it produces sections of text that aren’t useful and look confusing (for instance, lots of numbers mashed together)

Writing the Table Data to a Excel using PDFTables
Now that I have a PDF with all of the table data that I need, I can now use PDFTables to write the table data to an Excel/CSV file. The PDFTables package extracts tables from PDF files and allows the user to convert PDF tables to formats (CSV, XLM, or XLSX). It provides us with an API key using which we can post a request to the PDFTables website to get the table extraction. You can get an API key by creating an account on the site for a free trial (PDFtables.com is paid, getting an API Key is restricted to certain pages only). With this free trial, I was able to upload this pdf and write the response to an excel. This served my purpose, but since PDFTables.com is paid I moved on exploring other tools for data extraction.


Method 2: PDFMiner for extracting text data from PDFs

I came across a great Python-based solution to extract the text from a PDF is PDFMiner. PDFMiner has two command-line scripts namely pdf2txt.py (to extract text and images) and dumpdf.py (find objects and their coordinates). I used pdf2txt.py script to extract the pdf content to HTML format using below command.

pdf2txt.py -O myoutput -o myoutput/hispanic.html -t html -p 3 hispanic.pdf

Below is list of options which can be used with pdf2txt.py
Options:

  • -o output file name
  • -p comma-separated list of page numbers to extract
  • -t output format (text/html/xml/tag[for Tagged PDFs])
  • -O dirname (triggers extraction of images from PDF into directory)
  • -P password

The above command can be used to convert a PDF to HTML or XML. After installing PDFMiner, cd into the directory where the PDF file is located and ran the above command. The resulting file will be ‘hispanic.html’ which has the 3rd page from the PDF. Reading data from HTML can be done using Beautiful Soup. It is a powerful Python library for extracting data from XML and HTML files. I used BeautifulSoup for reading and extracting the data from hispanic.html. You can refer to my previous post on Data scraping using python for extracting table data from html and writing into a csv file. I wrote a quick script that will extract table data from web page using Wikipedia module and BeautifulSoup.


In this way, I used PDFMiner and PyPDF2 to extract the data, but you’ll still have to make a choice when deciding which to use and learn. Both libraries are in active development and the developers are dedicated to providing good code. There are several tools you can use to get what you need from them, and Python enables to get inside and scrape, split, merge, delete, and crop just about whatever you find.


In this post, I tried to showcase different approaches with few code snippets which I implemented in our requirement for extracting table data from PDF file by providing. I hope you will like it!

If you are a startup finding it hard to hire technical QA engineers, learn more about Qxf2 Services.


References:-

1) Manipulating PDFs with python and PyPDF2

2) Working with pdf file in python

3) Different PDF tools to extract text and data from pdfs


58 thoughts on “Extracting data from PDFs using Python

  1. I’m getting this error while doing

    python pdf2txt.py -o E:\zerodhaVarsity\out.xml -t xml E:\zerodhaVarsity\ZD6176_28112017_BSEMS_1.pdf

    even tried this too

    python pdf2txt.py -o E:\zerodhaVarsity\out.xml -t xml -S E:\zerodhaVarsity\ZD6176_28112017_BSEMS_1.pdf

    can anybody help me with this?

  2. Hi Indira
    Thanks for this blog post. It’s very useful. Indira I have a pdf of 300 pages. Each page has a table which contains Attendance of 50 employees for an year of an office. So in total I have attendance of 50 different different employees at 300 different offices.
    I want to feed the content of these 300 paged to Numpy for further processing.
    Need suggestion to solve this issue

    1. Hi,

      I think you can try below options:

      1) By using PDFTables, you can convert the pdf file into a .csv file
      2) After conversion, your .csv file should have all the 300 pages data in a single file with column headers in each page.
      3) Based on your requirement, you can write a logic to either split the csv file for each company record(eg., based on company id) and convert each file into DataFrame (or) remove the column headers from csv and create a single DataFrame.
      4) Using Pandas library, you can convert this csv data into dataframe.
      5) Two ways to convert the DataFrame to its Numpy-array representation.
      np_array = df.values
      np_array = df.as_matrix(columns=None)

      I haven’t tried this but you can also try tabula module which extracts pdf into pandas DataFrame.
      https://blog.chezo.uno/tabula-py-extract-table-from-pdf-into-python-dataframe-6c7acfa5f302

      I hope this helps!

  3. Hi Indira,
    Installed PDFMiner(seems it’s supported with only Python 2) using pip install.
    But cannot find pdf2txt.py & dumpdf.py installed under PDFMiner.
    So getting Syntax error on running-
    pdf2txt.py -O myoutput -o myoutput/hispanic.html -t html -p 3 hispanic.pdf
    Pls guide,have an urgent requirement

  4. Hi Indira,

    I am having a PDF attached link below
    http://css4.pub/2017/newsletter/drylab.pdf

    Is it possible to extract each paragraph as one string

    EX:
    “`
    Welcome to our first newsletter of 2017! It’s been a while since the last one, and a lot has happened. We promise to keep them coming every two months hereafter, and permit
    ourselves to make this one rather long. The big news is the beginnings of our launch in
    the American market, but there are also interesting updates on sales, development,
    mentors and (of course) the investment round that closed in January.
    “`

    1. Hi,
      I am not really sure about this. Using Adobe’s reader/writer API(paid) may be one way to go about it. If I get some other approaches will let you know shortly.

      Thanks & Regards
      Avinash Shetty

  5. Hi Indira,
    I want to convert a pdf file that contains nested lists and tables and needs to be converted into xml.
    How can it be achieved?

  6. Hi Indira/Team,
    I am working on a project that would require to extract a 12 digit Alphanumeric code with first two digits always as alphabet (e.g. IN0001102AB0) from pdf documents. I have tried to use mostly all the ways I could over internet. Finally I ‘ve reached at this blogpost of yours. Could you please help out if you can. The no of pages can start from 0 to thousands. Any help or guidance will be highly appreciated.

    1. Hi Shashi,
      Hope you are doing great!
      We did not have a chance to try pattern search within the PDF file. I believe it is easier trying straightforward text extract from PDF file into a text file using PDFMiner. Then you can rather refer to respective regular expression module with regular expression search pattern as per your requirement to fetch all matches from a text file. Here is some other StackOverflow solution https://stackoverflow.com/questions/17098675/searching-text-in-a-pdf-using-python, disregard if you have already tried this.
      Thank you,
      Qxf2 Team

  7. Hi, I am very new to Python and programming. I need to extract table data from pdf and convert it to xml. Here I found your code, however when using it, I get error and I do not know how to fix this. Could you please help me?
    I have pip installed PDFMiner and wrote this code of yours in command line:
    pdf2txt.py -O /Users/Mymacpro/PycharmProjects/pdftoxml -o /Users/Mymacpro/PycharmProjects/pdftoxml/testit.xml -t xml -p 8 test.pdf
    Her is the error I get:
    File “/Users/Mymacpro/PycharmProjects/pdftoxml/tablepdf.py”, line 17
    pdf2txt.py -O /Users/Mymacpro/PycharmProjects/pdftoxml -o /Users/Mymacpro/PycharmProjects/pdftoxml/testit.xml -t xml -p 8 test.pdf
    ^
    SyntaxError: invalid syntax

    Process finished with exit code 1

    What can the problem be?
    Thank you in advance.

    1. Hi,
      I would want to confirm if the PDFMinerinstallation went fine. Can you pls,
      1. try pdf2txt.py --help
      2. If step 1 worked fine,try the pdf2txt.py -o testit.xml -t xml -p 8 test.pdf command in location /Users/Mymacpro/PycharmProjects/pdftoxml/. Pls have the pdf file in that location.
      3. If step 1 failed then, run pip uninstall pdfminer and follow the steps in https://euske.github.io/pdfminer/#changes to install it again.
      PS PDFMiner is not compatible with Python 3.X version. Pls make sure you are running a 2.x version. You can check it using python --version command

  8. HI ,
    i have a pdf file from which i am looking to extract particular pages which has text as test your knowledge at the end of each chapter how do i do it

Leave a Reply

Your email address will not be published. Required fields are marked *