When testing highly data dependent products, I find it very useful to use data published by governments. When government organizations publish data online, barring a few notable exceptions, it usually releases it as a series of PDFs. The PDF file format was not designed to hold structured data, which makes extracting data from PDFs difficult. In this post, I will show you a couple of ways to extract text and table data from PDF file using Python and write it into a CSV or Excel file.
We will take an example of US census data for the Hispanic Population for 2010. If you look at the content of the PDF, you can see that there is a lot of text data, table data, graphs, maps etc. I will extract the table data for Hispanic or Latino Origin Population by Type: 2000 and 2010 from Page 3 of the PDF file.
For achieving this, I first tried using PyPDF2 (for extracting) and PDFtables (for converting PDF tables to Excel/CSV). It did serve my requirement but PDFtables.com is paid service.
Later I came across PDFMiner and started exploring it for extracting data using its pdf2txt.py script. I liked this solution much better and I am using it for my work.
Method 1: Extract the Pages with Tables using PyPDF2 and PDFTables
When I Googled around for ‘Python read pdf’, PyPDF2 was the first tool I stumbled upon. PyPDF2 can extract data from PDF files and manipulate existing PDFs to produce a new file. After spending a little time with it, I realized PyPDF2 does not have a way to extract images, charts, or other media from PDF documents. But it can extract text and return it as a Python string. Reading a PDF document is pretty simple and straight forward. I used PdfFileReader() and PdfFileWriter() classes for reading and writing the table data.
import PyPDF2 PDFfilename = "hispanic.pdf" #filename of your PDF/directory where your PDF is stored pfr = PyPDF2.PdfFileReader(open(PDFfilename, "rb")) #PdfFileReader object |
Firstly, I installed PyPDF2 library and imported it, created an instance of the PdfFileReader Class, which stores information about the PDF (number of pages, text on pages, etc). In this PDF, the table which I need extract is in Page 3. To extract this page, I used below code:-
pg3 = pfr.getPage(2) #extract pg 2 writer = PyPDF2.PdfFileWriter() #create PdfFileWriter object #add pages writer.addPage(pg3) #filename of your PDF/directory where you want your new PDF to be NewPDFfilename = "hispanic_tables.pdf" with open(NewPDFfilename, "wb") as outputStream: #create new PDF writer.write(outputStream) #write pages to new PDF |
I used the .getPage() method, with the page number + 1 as the parameter (pages start at 0), on PdfFileReader object. After that, I created a PdfFileWriter object, which will eventually write a new PDF and add the pages to it. The purpose of writing this page with tables into separate pdf file is that I used PDFTables for extracting data. PDFTables puts everything (not just tables) in the PDF document into the output Excel or CSV, to avoid having a lot of junk data in the Excel I created a separate PDF with just the table that I want to extract.
PyPDF2 library extracts the text from a PDF document very nicely. The problem with this is that if there are tables in the document, the text in the tables is extracted in-line with the rest of the document text. This can be problematic because it produces sections of text that aren’t useful and look confusing (for instance, lots of numbers mashed together)
Writing the Table Data to a Excel using PDFTables
Now that I have a PDF with all of the table data that I need, I can now use PDFTables to write the table data to an Excel/CSV file. The PDFTables package extracts tables from PDF files and allows the user to convert PDF tables to formats (CSV, XLM, or XLSX). It provides us with an API key using which we can post a request to the PDFTables website to get the table extraction. You can get an API key by creating an account on the site for a free trial (PDFtables.com is paid, getting an API Key is restricted to certain pages only). With this free trial, I was able to upload this pdf and write the response to an excel. This served my purpose, but since PDFTables.com is paid I moved on exploring other tools for data extraction.
Method 2: PDFMiner for extracting text data from PDFs
I came across a great Python-based solution to extract the text from a PDF is PDFMiner. PDFMiner has two command-line scripts namely pdf2txt.py (to extract text and images) and dumpdf.py (find objects and their coordinates). I used pdf2txt.py script to extract the pdf content to HTML format using below command.
pdf2txt.py -O myoutput -o myoutput/hispanic.html -t html -p 3 hispanic.pdf |
Below is list of options which can be used with pdf2txt.py
Options:
- -o output file name
- -p comma-separated list of page numbers to extract
- -t output format (text/html/xml/tag[for Tagged PDFs])
- -O dirname (triggers extraction of images from PDF into directory)
- -P password
The above command can be used to convert a PDF to HTML or XML. After installing PDFMiner, cd into the directory where the PDF file is located and ran the above command. The resulting file will be ‘hispanic.html’ which has the 3rd page from the PDF. Reading data from HTML can be done using Beautiful Soup. It is a powerful Python library for extracting data from XML and HTML files. I used BeautifulSoup for reading and extracting the data from hispanic.html. You can refer to my previous post on Data scraping using python for extracting table data from html and writing into a csv file. I wrote a quick script that will extract table data from web page using Wikipedia module and BeautifulSoup.
In this way, I used PDFMiner and PyPDF2 to extract the data, but you’ll still have to make a choice when deciding which to use and learn. Both libraries are in active development and the developers are dedicated to providing good code. There are several tools you can use to get what you need from them, and Python enables to get inside and scrape, split, merge, delete, and crop just about whatever you find.
In this post, I tried to showcase different approaches with few code snippets which I implemented in our requirement for extracting table data from PDF file by providing. I hope you will like it!
If you are a startup finding it hard to hire technical QA engineers, learn more about Qxf2 Services.
References:-
1) Manipulating PDFs with python and PyPDF2
2) Working with pdf file in python
3) Different PDF tools to extract text and data from pdfs
I am an experienced engineer who has worked with top IT firms in India, gaining valuable expertise in software development and testing. My journey in QA began at Dell, where I focused on the manufacturing domain. This experience provided me with a strong foundation in quality assurance practices and processes.
I joined Qxf2 in 2016, where I continued to refine my skills, enhancing my proficiency in Python. I also expanded my skill set to include JavaScript, gaining hands-on experience and even build frameworks from scratch using TestCafe. Throughout my journey at Qxf2, I have had the opportunity to work on diverse technologies and platforms which includes working on powerful data validation framework like Great Expectations, AI tools like Whisper AI, and developed expertise in various web scraping techniques. I recently started exploring Rust. I enjoy working with variety of tools and sharing my experiences through blogging.
My interests are vegetable gardening using organic methods, listening to music and reading books.
I’m getting this error while doing
python pdf2txt.py -o E:\zerodhaVarsity\out.xml -t xml E:\zerodhaVarsity\ZD6176_28112017_BSEMS_1.pdf
even tried this too
python pdf2txt.py -o E:\zerodhaVarsity\out.xml -t xml -S E:\zerodhaVarsity\ZD6176_28112017_BSEMS_1.pdf
can anybody help me with this?
Hi,
Could you please provide the error details?
where is pdf2txt.py file
Hi,
pdf2txt.py is command line tool that is part of pdfminer. It is here – https://github.com/euske/pdfminer/blob/master/tools/pdf2txt.py
It is built within pdfminer package. And pdfminer adds it to the system path and makes it available to be run from command line.
it is working .
Hi Indira
Thanks for this blog post. It’s very useful. Indira I have a pdf of 300 pages. Each page has a table which contains Attendance of 50 employees for an year of an office. So in total I have attendance of 50 different different employees at 300 different offices.
I want to feed the content of these 300 paged to Numpy for further processing.
Need suggestion to solve this issue
Hi,
I think you can try below options:
1) By using PDFTables, you can convert the pdf file into a .csv file
2) After conversion, your .csv file should have all the 300 pages data in a single file with column headers in each page.
3) Based on your requirement, you can write a logic to either split the csv file for each company record(eg., based on company id) and convert each file into DataFrame (or) remove the column headers from csv and create a single DataFrame.
4) Using Pandas library, you can convert this csv data into dataframe.
5) Two ways to convert the DataFrame to its Numpy-array representation.
np_array = df.values
np_array = df.as_matrix(columns=None)
I haven’t tried this but you can also try tabula module which extracts pdf into pandas DataFrame.
https://blog.chezo.uno/tabula-py-extract-table-from-pdf-into-python-dataframe-6c7acfa5f302
I hope this helps!
Hi Indira,
Installed PDFMiner(seems it’s supported with only Python 2) using pip install.
But cannot find pdf2txt.py & dumpdf.py installed under PDFMiner.
So getting Syntax error on running-
pdf2txt.py -O myoutput -o myoutput/hispanic.html -t html -p 3 hispanic.pdf
Pls guide,have an urgent requirement
Hi Shikha,
What is the error you are getting?
You should find pdf2txt.py & dumpdf.py inside your $\Python27\Scripts folder.
You probably need to run the command using python eg: python pdf2txt.py -O myoutput -o myoutput/hispanic.html -t html -p 3 hispanic.pdf
or make sure $\Python27\Scripts is added to your system environment path
Ref: https://stackoverflow.com/questions/31574629/pdf2txt-py-not-executing-command
Hi Indira,
I am having a PDF attached link below
http://css4.pub/2017/newsletter/drylab.pdf
Is it possible to extract each paragraph as one string
EX:
“`
Welcome to our first newsletter of 2017! It’s been a while since the last one, and a lot has happened. We promise to keep them coming every two months hereafter, and permit
ourselves to make this one rather long. The big news is the beginnings of our launch in
the American market, but there are also interesting updates on sales, development,
mentors and (of course) the investment round that closed in January.
“`
Hi,
I am not really sure about this. Using Adobe’s reader/writer API(paid) may be one way to go about it. If I get some other approaches will let you know shortly.
Thanks & Regards
Avinash Shetty
Use Tesseract OCR
Hi Indira,
I want to convert a pdf file that contains nested lists and tables and needs to be converted into xml.
How can it be achieved?
Hi Pranav,
If either PDFMiner or PDFTables module didn’t work in your case, you can try pypdf2xml module though we haven’t tried this module yet. And you can also refer the following link for more pdf extract tools. http://okfnlabs.org/blog/2016/04/19/pdf-tools-extract-text-and-data-from-pdfs.html
Thanks,
Rohan Dudam
Hi Indira/Team,
I am working on a project that would require to extract a 12 digit Alphanumeric code with first two digits always as alphabet (e.g. IN0001102AB0) from pdf documents. I have tried to use mostly all the ways I could over internet. Finally I ‘ve reached at this blogpost of yours. Could you please help out if you can. The no of pages can start from 0 to thousands. Any help or guidance will be highly appreciated.
Hi Shashi,
Hope you are doing great!
We did not have a chance to try pattern search within the PDF file. I believe it is easier trying straightforward text extract from PDF file into a text file using PDFMiner. Then you can rather refer to respective regular expression module with regular expression search pattern as per your requirement to fetch all matches from a text file. Here is some other StackOverflow solution https://stackoverflow.com/questions/17098675/searching-text-in-a-pdf-using-python, disregard if you have already tried this.
Thank you,
Qxf2 Team
Hi, I am very new to Python and programming. I need to extract table data from pdf and convert it to xml. Here I found your code, however when using it, I get error and I do not know how to fix this. Could you please help me?
I have pip installed PDFMiner and wrote this code of yours in command line:
pdf2txt.py -O /Users/Mymacpro/PycharmProjects/pdftoxml -o /Users/Mymacpro/PycharmProjects/pdftoxml/testit.xml -t xml -p 8 test.pdf
Her is the error I get:
File “/Users/Mymacpro/PycharmProjects/pdftoxml/tablepdf.py”, line 17
pdf2txt.py -O /Users/Mymacpro/PycharmProjects/pdftoxml -o /Users/Mymacpro/PycharmProjects/pdftoxml/testit.xml -t xml -p 8 test.pdf
^
SyntaxError: invalid syntax
Process finished with exit code 1
What can the problem be?
Thank you in advance.
Hi,
I would want to confirm if the
PDFMiner
installation went fine. Can you pls,1. try
pdf2txt.py --help
2. If step 1 worked fine,try the
pdf2txt.py -o testit.xml -t xml -p 8 test.pdf
command in location /Users/Mymacpro/PycharmProjects/pdftoxml/. Pls have the pdf file in that location.3. If step 1 failed then, run
pip uninstall pdfminer
and follow the steps in https://euske.github.io/pdfminer/#changes to install it again.PS PDFMiner is not compatible with Python 3.X version. Pls make sure you are running a 2.x version. You can check it using
python --version
commandHow do I annonymise certain column in a PDF
Hi,
Sorry. We may not be able to help you on this. We have not tried editing PDF before.
HI ,
i have a pdf file from which i am looking to extract particular pages which has text as test your knowledge at the end of each chapter how do i do it
Hi,
Possible solutions to get the page number based on text search:
1. https://stackoverflow.com/questions/17098675/searching-text-in-a-pdf-using-python (Simple)
2. https://stackoverflow.com/questions/32430728/python-extract-text-from-pdf-page-wise-to-list (Complex)
Please have a look at the same.
Thanks,
Nilaya.