When testing highly data dependent products, I find it very useful to use data published by governments. When government organizations publish data online, barring a few notable exceptions, it usually releases it as a series of PDFs. The PDF file format was not designed to hold structured data, which makes extracting data from PDFs difficult. In this post, I will show you a couple of ways to extract text and table data from PDF file using Python and write it into a CSV or Excel file.
We will take an example of US census data for the Hispanic Population for 2010. If you look at the content of the PDF, you can see that there is a lot of text data, table data, graphs, maps etc. I will extract the table data for Hispanic or Latino Origin Population by Type: 2000 and 2010 from Page 3 of the PDF file.
For achieving this, I first tried using PyPDF2 (for extracting) and PDFtables (for converting PDF tables to Excel/CSV). It did serve my requirement but PDFtables.com is paid service.
Later I came across PDFMiner and started exploring it for extracting data using its pdf2txt.py script. I liked this solution much better and I am using it for my work.
Method 1: Extract the Pages with Tables using PyPDF2 and PDFTables
When I Googled around for ‘Python read pdf’, PyPDF2 was the first tool I stumbled upon. PyPDF2 can extract data from PDF files and manipulate existing PDFs to produce a new file. After spending a little time with it, I realized PyPDF2 does not have a way to extract images, charts, or other media from PDF documents. But it can extract text and return it as a Python string. Reading a PDF document is pretty simple and straight forward. I used PdfFileReader() and PdfFileWriter() classes for reading and writing the table data.
import PyPDF2 PDFfilename = "hispanic.pdf" #filename of your PDF/directory where your PDF is stored pfr = PyPDF2.PdfFileReader(open(PDFfilename, "rb")) #PdfFileReader object |
Firstly, I installed PyPDF2 library and imported it, created an instance of the PdfFileReader Class, which stores information about the PDF (number of pages, text on pages, etc). In this PDF, the table which I need extract is in Page 3. To extract this page, I used below code:-
pg3 = pfr.getPage(2) #extract pg 2 writer = PyPDF2.PdfFileWriter() #create PdfFileWriter object #add pages writer.addPage(pg3) #filename of your PDF/directory where you want your new PDF to be NewPDFfilename = "hispanic_tables.pdf" with open(NewPDFfilename, "wb") as outputStream: #create new PDF writer.write(outputStream) #write pages to new PDF |
I used the .getPage() method, with the page number + 1 as the parameter (pages start at 0), on PdfFileReader object. After that, I created a PdfFileWriter object, which will eventually write a new PDF and add the pages to it. The purpose of writing this page with tables into separate pdf file is that I used PDFTables for extracting data. PDFTables puts everything (not just tables) in the PDF document into the output Excel or CSV, to avoid having a lot of junk data in the Excel I created a separate PDF with just the table that I want to extract.
PyPDF2 library extracts the text from a PDF document very nicely. The problem with this is that if there are tables in the document, the text in the tables is extracted in-line with the rest of the document text. This can be problematic because it produces sections of text that aren’t useful and look confusing (for instance, lots of numbers mashed together)
Writing the Table Data to a Excel using PDFTables
Now that I have a PDF with all of the table data that I need, I can now use PDFTables to write the table data to an Excel/CSV file. The PDFTables package extracts tables from PDF files and allows the user to convert PDF tables to formats (CSV, XLM, or XLSX). It provides us with an API key using which we can post a request to the PDFTables website to get the table extraction. You can get an API key by creating an account on the site for a free trial (PDFtables.com is paid, getting an API Key is restricted to certain pages only). With this free trial, I was able to upload this pdf and write the response to an excel. This served my purpose, but since PDFTables.com is paid I moved on exploring other tools for data extraction.
Method 2: PDFMiner for extracting text data from PDFs
I came across a great Python-based solution to extract the text from a PDF is PDFMiner. PDFMiner has two command-line scripts namely pdf2txt.py (to extract text and images) and dumpdf.py (find objects and their coordinates). I used pdf2txt.py script to extract the pdf content to HTML format using below command.
pdf2txt.py -O myoutput -o myoutput/hispanic.html -t html -p 3 hispanic.pdf |
Below is list of options which can be used with pdf2txt.py
Options:
- -o output file name
- -p comma-separated list of page numbers to extract
- -t output format (text/html/xml/tag[for Tagged PDFs])
- -O dirname (triggers extraction of images from PDF into directory)
- -P password
The above command can be used to convert a PDF to HTML or XML. After installing PDFMiner, cd into the directory where the PDF file is located and ran the above command. The resulting file will be ‘hispanic.html’ which has the 3rd page from the PDF. Reading data from HTML can be done using Beautiful Soup. It is a powerful Python library for extracting data from XML and HTML files. I used BeautifulSoup for reading and extracting the data from hispanic.html. You can refer to my previous post on Data scraping using python for extracting table data from html and writing into a csv file. I wrote a quick script that will extract table data from web page using Wikipedia module and BeautifulSoup.
In this way, I used PDFMiner and PyPDF2 to extract the data, but you’ll still have to make a choice when deciding which to use and learn. Both libraries are in active development and the developers are dedicated to providing good code. There are several tools you can use to get what you need from them, and Python enables to get inside and scrape, split, merge, delete, and crop just about whatever you find.
In this post, I tried to showcase different approaches with few code snippets which I implemented in our requirement for extracting table data from PDF file by providing. I hope you will like it!
If you are a startup finding it hard to hire technical QA engineers, learn more about Qxf2 Services.
References:-
1) Manipulating PDFs with python and PyPDF2
2) Working with pdf file in python
3) Different PDF tools to extract text and data from pdfs
I am an experienced engineer who has worked with top IT firms in India, gaining valuable expertise in software development and testing. My journey in QA began at Dell, where I focused on the manufacturing domain. This experience provided me with a strong foundation in quality assurance practices and processes.
I joined Qxf2 in 2016, where I continued to refine my skills, enhancing my proficiency in Python. I also expanded my skill set to include JavaScript, gaining hands-on experience and even build frameworks from scratch using TestCafe. Throughout my journey at Qxf2, I have had the opportunity to work on diverse technologies and platforms which includes working on powerful data validation framework like Great Expectations, AI tools like Whisper AI, and developed expertise in various web scraping techniques. I recently started exploring Rust. I enjoy working with variety of tools and sharing my experiences through blogging.
My interests are vegetable gardening using organic methods, listening to music and reading books.
Hi Indira/Team,
I want a extract each word from a pdf with it’s coordinates. Currently I am using LTChar which gives coordinates of each character. I am also able to get the coordinates of the whole line using LTTextLine. But I need to for the words. Is this doable using PDFMiner?
Please guide.
Thanks in advance.,
Shikha Thakur
Hi Shikha,
We are not sure how to get each word with its coordinates. One idea which we had though is since you are using LTChar to get characters, you can simply put the characters in a list, join them all and then split based on whitespace?
Thanks & Regards
Avinash Shetty
PYPDF2 gives text only searchable pdf files. I have lot of pdf’s those are non searchable pdf’s. how can i extract text from that pdf’s
Hi,
Though our experience is pretty poor using OCR (optical character recognition) tool, you probably need to try Tesseract OCR for converting a nonsearchable PDF into a searchable PDF and then extract the text. You can refer to the below links for more details.
1.https://pypi.org/project/pypdfocr/
2.https://github.com/virantha/pypdfocr
Hi,
I have a pdf for payroll of employees, it contains data as shown below:
FOLHA DE PAGAMENTO ANALÍTICA
Empresa : GOOD JOB SEG E VIG PATRIMONIAL LTDA ( 00007 ) Página : 00001
End. : ALAMEDA DOS PIRATINIS, 704 CNPJ/CEI: 10.336.666/0001-79
Ref.: 01/05/2018 a 31/05/2018 Dpto : CONDOMINIO EDIFICIO INTERNATIONAL TRADE CENTER RES
OS VALORES DE FÉRIAS E RESCISÃO JÁ FORAM PAGOS.
Código Nome Ref. Sal. Contratual Adicionais Descontos Líquido Recibo
010017 DULCINEA SILVA TEIXEIRA NUNES 1.487,00 Função : VIGILANTE Livro: 0000 Folha. : 000
Admissão : 21/09/2011 Dep IR : 0 Dep SF : 0
001 SALÁRIO BASE 220:00 1.487,00
103 HORA EXTRA 60% 001:00 10,81
035 AD. NOTURNO 20% 112:00 151,40
020 AD. PERICULOSIDADE 446,10
420 REPOUSO REMUNERADO- DSR 38,93
999 ARREDOND. DO MES 0,60
654 DESC. CONVENIO MEDICO 5% 74,35
644 DESC. 6% VALE TRANSP. 89,22
661 DESC. 18% VALE REF. 40,72
610 ARREDOND. DO M/A 0,60
617 CONTR. ASSISTENCIAL 1% 14,87
903 INSS FOLHA 192,08
Resumo do Líquido 2.134,84 411,84 *******1*.*7*2*3*,*0**0
(+) Folha Analítica ………………….: 1.723,00
(-) Adiantamento……………………..: 0,00
(-) Férias ……………………………….: 0,00
(-) Rescisão …………………………..: 0,00
(-) 13o Salário ………………………..: 0,00
____/____/______
(=) Total Líqüido ……………………..: ********************1*.*7*2**3*,*0*0*
Base INSS 2.134,24 Base FGTS 2.134,24 FGTS 170,73 Base IRRF 2.134,24
this is not in tabular format but I need to extract this as table.
Could anyone please help me…
Hi,
I don’t think we can extract/convert non tabular data of PDF in the format of table data directly.You can try text extract from PDF and save it to CSV , manually format the CSV data and create data table from CSV data using respective some module.
Hi,
I have a requirement that I have multiple PDF files. I want to search all PDF files based on some keyword and return the name of PDF file in which that keyword exists.
Hi Naveen,
I understood your requirements. To achieve that you need to convert the pdf to text and from text to the list of words. And need to search the keyword you are looking for.
PyPDF2 help you to convert a pdf file into text and refer the following link to convert the text to keywords. https://medium.com/@rqaiserr/how-to-convert-pdfs-into-searchable-key-words-with-python-85aab86c544f
Thanks
Converting PDF to excel
Hello all
I have PDF files all are in tables content, and on each page there is a page header, I want to extract all data into excel and trying to keep the PDF format. I used PyPDF2.py but its only convert into text.
any ideas.
Hi
Convert the pages that you want in the
PDF
file to aHTML
file and follow the instructions here – https://qxf2.com/blog/web-scraping-using-python/ to convert it to Pandas DataFrame. The Pandas DataFrame can then be converted to excel.It looks a very nice code and I know that it will work, but I stopped to apply it because the pdf that I want to read it is with password, how would you read that?
Hi,
Did you try
pdf2txt.py -O dirname -o output-file-name new-filename -t html -p page-no -P password pdf-name
?hi,
I have a pdf report,i need to programatically extract the content into another pdf containing the heading specified by user.
Is there a way this could be done?
Hii, we haven’t tried that but you may take a look at this https://dzone.com/articles/splitting-and-merging-pdfs-with-python that may help you to solve the problem
Hello Indira, I have pdf file that needs to be convert into a csv/xl/xml…
once convert this to csv, xlsx, or xml.
The format is weird, as the merged cell values are not aligned correctly (leaving no way to ffill/bfill)
pdf file looks like: https://i.stack.imgur.com/joCvn.png
Hi Mithin,
Did you try https://pdftables.com/. I tried a convert a simple pdf file with merged cells and converted it to xls format. I found it to have a better format than what pdf2txt gave me.
Hi Indira,
I have to extract to Images from PDF files and PDF files are made up of only image files(TIFF).
Can you suggest open source tools for that? I tried using PDF2image but it didn’t work.
Hi,
We have never tried to extract images from a PDF before, but a quick search helped identify PyMuPDF module.
Pls refer the snippet to extract the image using PyMuPDF here – https://stackoverflow.com/questions/2693820/extract-images-from-pdf-without-resampling-in-python.
Let us know if this module was useful. We will decide to explore this module based on your comment.
Hi Indira,
I have invoices in pdf format for the company I work in and need certain data from those pdf file. Can you suggest where I can start from. I am able to extract the data from pdf to dataframe but I am not able to collect the specific data I want as it contains both a table and non tabular data.
Hi,
Did you try PDFTables for extracting the tables from the pdf? For extracting non tabular data you can use PDFMiner. My suggestion is that based on your requirement, you may use both these modules and extract table and non tabular data.