Extracting data from PDFs using Python

When testing highly data dependent products, I find it very useful to use data published by governments. When government organizations publish data online, barring a few notable exceptions, it usually releases it as a series of PDFs. The PDF file format was not designed to hold structured data, which makes extracting data from PDFs difficult. In this post, I will show you a couple of ways to extract text and table data from PDF file using Python and write it into a CSV or Excel file.


We will take an example of US census data for the Hispanic Population for 2010. If you look at the content of the PDF, you can see that there is a lot of text data, table data, graphs, maps etc. I will extract the table data for Hispanic or Latino Origin Population by Type: 2000 and 2010 from Page 3 of the PDF file.

For achieving this, I first tried using PyPDF2 (for extracting) and PDFtables (for converting PDF tables to Excel/CSV). It did serve my requirement but PDFtables.com is paid service.

Later I came across PDFMiner and started exploring it for extracting data using its pdf2txt.py script. I liked this solution much better and I am using it for my work.


Method 1: Extract the Pages with Tables using PyPDF2 and PDFTables

When I Googled around for ‘Python read pdf’, PyPDF2 was the first tool I stumbled upon. PyPDF2 can extract data from PDF files and manipulate existing PDFs to produce a new file. After spending a little time with it, I realized PyPDF2 does not have a way to extract images, charts, or other media from PDF documents. But it can extract text and return it as a Python string. Reading a PDF document is pretty simple and straight forward. I used PdfFileReader() and PdfFileWriter() classes for reading and writing the table data.

import PyPDF2
 
PDFfilename = "hispanic.pdf" #filename of your PDF/directory where your PDF is stored
 
pfr = PyPDF2.PdfFileReader(open(PDFfilename, "rb")) #PdfFileReader object

Firstly, I installed PyPDF2 library and imported it, created an instance of the PdfFileReader Class, which stores information about the PDF (number of pages, text on pages, etc). In this PDF, the table which I need extract is in Page 3. To extract this page, I used below code:-

pg3 = pfr.getPage(2) #extract pg 2
writer = PyPDF2.PdfFileWriter() #create PdfFileWriter object
 
#add pages
writer.addPage(pg3)
 
#filename of your PDF/directory where you want your new PDF to be
NewPDFfilename = "hispanic_tables.pdf" 
 
with open(NewPDFfilename, "wb") as outputStream: #create new PDF
    writer.write(outputStream) #write pages to new PDF

I used the .getPage() method, with the page number + 1 as the parameter (pages start at 0), on PdfFileReader object. After that, I created a PdfFileWriter object, which will eventually write a new PDF and add the pages to it. The purpose of writing this page with tables into separate pdf file is that I used PDFTables for extracting data. PDFTables puts everything (not just tables) in the PDF document into the output Excel or CSV, to avoid having a lot of junk data in the Excel I created a separate PDF with just the table that I want to extract.

PyPDF2 library extracts the text from a PDF document very nicely. The problem with this is that if there are tables in the document, the text in the tables is extracted in-line with the rest of the document text. This can be problematic because it produces sections of text that aren’t useful and look confusing (for instance, lots of numbers mashed together)

Writing the Table Data to a Excel using PDFTables
Now that I have a PDF with all of the table data that I need, I can now use PDFTables to write the table data to an Excel/CSV file. The PDFTables package extracts tables from PDF files and allows the user to convert PDF tables to formats (CSV, XLM, or XLSX). It provides us with an API key using which we can post a request to the PDFTables website to get the table extraction. You can get an API key by creating an account on the site for a free trial (PDFtables.com is paid, getting an API Key is restricted to certain pages only). With this free trial, I was able to upload this pdf and write the response to an excel. This served my purpose, but since PDFTables.com is paid I moved on exploring other tools for data extraction.


Method 2: PDFMiner for extracting text data from PDFs

I came across a great Python-based solution to extract the text from a PDF is PDFMiner. PDFMiner has two command-line scripts namely pdf2txt.py (to extract text and images) and dumpdf.py (find objects and their coordinates). I used pdf2txt.py script to extract the pdf content to HTML format using below command.

pdf2txt.py -O myoutput -o myoutput/hispanic.html -t html -p 3 hispanic.pdf

Below is list of options which can be used with pdf2txt.py
Options:

  • -o output file name
  • -p comma-separated list of page numbers to extract
  • -t output format (text/html/xml/tag[for Tagged PDFs])
  • -O dirname (triggers extraction of images from PDF into directory)
  • -P password

The above command can be used to convert a PDF to HTML or XML. After installing PDFMiner, cd into the directory where the PDF file is located and ran the above command. The resulting file will be ‘hispanic.html’ which has the 3rd page from the PDF. Reading data from HTML can be done using Beautiful Soup. It is a powerful Python library for extracting data from XML and HTML files. I used BeautifulSoup for reading and extracting the data from hispanic.html. You can refer to my previous post on Data scraping using python for extracting table data from html and writing into a csv file. I wrote a quick script that will extract table data from web page using Wikipedia module and BeautifulSoup.


In this way, I used PDFMiner and PyPDF2 to extract the data, but you’ll still have to make a choice when deciding which to use and learn. Both libraries are in active development and the developers are dedicated to providing good code. There are several tools you can use to get what you need from them, and Python enables to get inside and scrape, split, merge, delete, and crop just about whatever you find.


In this post, I tried to showcase different approaches with few code snippets which I implemented in our requirement for extracting table data from PDF file by providing. I hope you will like it!

If you are a startup finding it hard to hire technical QA engineers, learn more about Qxf2 Services.


References:-

1) Manipulating PDFs with python and PyPDF2

2) Working with pdf file in python

3) Different PDF tools to extract text and data from pdfs


56 thoughts on “Extracting data from PDFs using Python

  1. Hi Indira/Team,

    I want a extract each word from a pdf with it’s coordinates. Currently I am using LTChar which gives coordinates of each character. I am also able to get the coordinates of the whole line using LTTextLine. But I need to for the words. Is this doable using PDFMiner?

    Please guide.
    Thanks in advance.,
    Shikha Thakur

    1. Hi Shikha,

      We are not sure how to get each word with its coordinates. One idea which we had though is since you are using LTChar to get characters, you can simply put the characters in a list, join them all and then split based on whitespace?

      Thanks & Regards
      Avinash Shetty

  2. PYPDF2 gives text only searchable pdf files. I have lot of pdf’s those are non searchable pdf’s. how can i extract text from that pdf’s

  3. Hi,
    I have a pdf for payroll of employees, it contains data as shown below:

    FOLHA DE PAGAMENTO ANALÍTICA
    Empresa : GOOD JOB SEG E VIG PATRIMONIAL LTDA ( 00007 ) Página : 00001
    End. : ALAMEDA DOS PIRATINIS, 704 CNPJ/CEI: 10.336.666/0001-79
    Ref.: 01/05/2018 a 31/05/2018 Dpto : CONDOMINIO EDIFICIO INTERNATIONAL TRADE CENTER RES
    OS VALORES DE FÉRIAS E RESCISÃO JÁ FORAM PAGOS.
    Código Nome Ref. Sal. Contratual Adicionais Descontos Líquido Recibo
    010017 DULCINEA SILVA TEIXEIRA NUNES 1.487,00 Função : VIGILANTE Livro: 0000 Folha. : 000
    Admissão : 21/09/2011 Dep IR : 0 Dep SF : 0
    001 SALÁRIO BASE 220:00 1.487,00
    103 HORA EXTRA 60% 001:00 10,81
    035 AD. NOTURNO 20% 112:00 151,40
    020 AD. PERICULOSIDADE 446,10
    420 REPOUSO REMUNERADO- DSR 38,93
    999 ARREDOND. DO MES 0,60
    654 DESC. CONVENIO MEDICO 5% 74,35
    644 DESC. 6% VALE TRANSP. 89,22
    661 DESC. 18% VALE REF. 40,72
    610 ARREDOND. DO M/A 0,60
    617 CONTR. ASSISTENCIAL 1% 14,87
    903 INSS FOLHA 192,08
    Resumo do Líquido 2.134,84 411,84 *******1*.*7*2*3*,*0**0
    (+) Folha Analítica ………………….: 1.723,00
    (-) Adiantamento……………………..: 0,00
    (-) Férias ……………………………….: 0,00
    (-) Rescisão …………………………..: 0,00
    (-) 13o Salário ………………………..: 0,00
    ____/____/______
    (=) Total Líqüido ……………………..: ********************1*.*7*2**3*,*0*0*
    Base INSS 2.134,24 Base FGTS 2.134,24 FGTS 170,73 Base IRRF 2.134,24

    this is not in tabular format but I need to extract this as table.
    Could anyone please help me…

    1. Hi,
      I don’t think we can extract/convert non tabular data of PDF in the format of table data directly.You can try text extract from PDF and save it to CSV , manually format the CSV data and create data table from CSV data using respective some module.

  4. Hi,

    I have a requirement that I have multiple PDF files. I want to search all PDF files based on some keyword and return the name of PDF file in which that keyword exists.

  5. Converting PDF to excel
    Hello all

    I have PDF files all are in tables content, and on each page there is a page header, I want to extract all data into excel and trying to keep the PDF format. I used PyPDF2.py but its only convert into text.

    any ideas.

  6. It looks a very nice code and I know that it will work, but I stopped to apply it because the pdf that I want to read it is with password, how would you read that?

    1. Hi,
      Did you try pdf2txt.py -O dirname -o output-file-name new-filename -t html -p page-no -P password pdf-name ?

  7. hi,

    I have a pdf report,i need to programatically extract the content into another pdf containing the heading specified by user.

    Is there a way this could be done?

  8. Hi Indira,
    I have to extract to Images from PDF files and PDF files are made up of only image files(TIFF).
    Can you suggest open source tools for that? I tried using PDF2image but it didn’t work.

  9. Hi Indira,
    I have invoices in pdf format for the company I work in and need certain data from those pdf file. Can you suggest where I can start from. I am able to extract the data from pdf to dataframe but I am not able to collect the specific data I want as it contains both a table and non tabular data.

    1. Hi,

      Did you try PDFTables for extracting the tables from the pdf? For extracting non tabular data you can use PDFMiner. My suggestion is that based on your requirement, you may use both these modules and extract table and non tabular data.

Leave a Reply

Your email address will not be published.