python

How to compare PDFs using Python

Problem:How do you compare two PDF files programmatically using Python?

Adobe makes it easy to compare the changes in two PDF files. However as testers, we sometimes need to compare a lot of PDF files (especially reports!) against some preset baselines. In these cases, it helps to have a script that can compare PDF files and tell you if they differ in any way.

There are several options. We like DiffPDF, pdf2text, the pdf-diff python module. Each option comes with its own set of pros and cons. Most of these solutions do a good job of comparing the text in the PDF files. However we noticed that they are somewhat lacking when it comes to comparing graphs and charts. In this post, we show you one more approach which is useful if you have a lot of graphs and charts in your PDF files.

WARNING: You should be using this kind of automated check as a last resort. Ask your developers for other ways to check the data/content of the PDF files before using this approach.

Steps involved

We will be using image comparison to verify if the two PDF files are identical or not. To do so, we need to:
1. Get setup with ImageMagick and Ghostscript
2. Convert each page of the PDF file into one image
3. Compare corresponding images and save the resulting difference image for every page
4. Stitch all the resulting difference images into a single PDF file
5. Use the utility to compare two PDF files

I have created a class PDF_Image_Compare which can be used to compare two PDFs. The class will help you compare two PDF files, list out which pages differ and give you a overlaid images of the two PDF files. Below few steps will explain the different methods and modules which are required to compare two PDF files.

Step 1. Get setup with ImageMagick and Ghostscript
The first step is to convert the PDF file to a different format like jpg. We will use ImageMagick, which in turn uses Ghostscript. To do this you need to:
a. Download and install ImageMagick which is a software suite to create, edit, compose, or convert bitmap images
b. ImageMagick needs Ghostscript which is an interpreter for the PostScript language and for PDF.
c. Add both ImageMagick and GhostScript to your path environment variable.
d. Verify you are setup correctly by using the “convert” utility. Open a command prompt and run the command ‘convert file.pdf file.jpg’ to convert file.pdf into a file.jpg.

Step 2. Convert each page of the PDF file into one image
We plan to use the difference method in Imagechops module which returns the absolute value of the difference between the two images. However we can’t use it directly on a PDF file. So we first have to convert the PDF into a list of images. To do so, we will call convert from Python using the subprocess module.

    def get_image_list_from_pdf(self,pdf_file):
        "Return a list of images that resulted from running convert on a given pdf"
        pdf_name = pdf_file.split(os.sep)[-1].split('.pdf')[0]
        pdf_dir = pdf_file.split(pdf_name)[0]
        jpg = pdf_file.split('.pdf')[0]+'.jpg'
        # Convert the pdf file to jpg file
        self.call_convert(pdf_file,jpg)
        #Get all the jpg files after calling convert and store it in a list
        image_list = []        
        file_list = os.listdir(pdf_dir)
        for f in file_list:
            if f[-4:]=='.jpg' and pdf_name in f:
                #Make sure the file names of both pdf are not similar
                image_list.append(f)
 
        print('Total of %d jpgs produced after converting the pdf file: %s'%(len(image_list),pdf_file))
        return image_list
 
 
    def call_convert(self,src,dest):
        "Call convert to convert pdf to jpg"
        print('About to call convert on %s'%src)
        try:
            subprocess.check_call(["convert",src,dest], shell=True)
        except Exception,e:
            print('Convert exception ... could be an ImageMagick bug')
            print(e)
        print('Finished calling convert on %s'%src)

Step 3. Compare corresponding images and save the resulting difference image for every page
Now we can use the ImageChops.difference() method to compare the images from the list of images created.
The below method would help you achieve this. It will return result_flag as True if the images match and False if they do not. The method will also print out the image pairs that differ.

def create_diff_image(self,pdf1_list,pdf2_list,diff_image_dir):
        "Creates the diffed images in diff image directory and generates a pdf by calling call convert"
        result_flag = True
        for pdf1_img,pdf2_img in zip(pdf1_list,pdf2_list):
            diff_filename = diff_image_dir + os.sep+'diff_' + pdf2_img
            try:
                pdf2_image = Image.open(self.download_dir +os.sep+ pdf2_img)
                pdf1_image = Image.open(self.download_dir+os.sep + pdf1_img)
                diff = ImageChops.difference(pdf2_image,pdf1_image)
                diff.save(diff_filename)
 
                if (ImageChops.difference(pdf2_image,pdf1_image).getbbox() is None):
                    result_flag = result_flag & True
                else:
                    result_flag = result_flag & False
                    print ('The file didnt match for: \n>>%s\nand\n>>%s'%(self.download_dir +os.sep+ pdf2_img,self.download_dir +os.sep+ pdf1_img))
            except Exception,e:
                print('Error when trying to open image')
                result_flag = result_flag & False
 
        return result_flag

Step 4. Stitch all the resulting difference images into a single PDF file
We found it useful to present the final difference as one PDF file rather than as a series of images. This makes it easier for the human interpreting the results to quickly identify and summarize the differences. To do so, add the below code to the create_diff_image method above.

        #Create a pdf out of all the jpgs created
        diff_pdf_name = 'diff_'+pdf2_img.split('.jpg')[0]+'.pdf'
        self.call_convert(diff_image_dir+os.sep+'*.jpg', self.download_dir+os.sep+diff_pdf_name)
 
        if os.path.exists(diff_pdf_name):
            print('Successfully created the difference pdf: %s'%(diff_pdf_name))

Step 5. Use the utility to compare two PDF files

 
if __name__== '__main__':
    #Lets accept command line options for the location of two PDF files from the user 
    #We have chosen to use the Python module optparse 
    usage = "usage: %prog --f1 <pdf1> --f2 <pdf2>\nE.g.: %prog --f1 'D:\Image Compare\Sample.pdf' --f2 'D:\Image Compare\Test.pdf'\n---"
    parser = OptionParser(usage=usage)
    parser.add_option("--f1","--pdf1",dest="pdf1",help="The location of pdf file1",default=None)
    parser.add_option("--f2","--pdf2",dest="pdf2",help="The location of pdf file2",default=None)
    (options,args) = parser.parse_args()
 
    test_obj = PDF_Image_Compare(pdf1=options.pdf1,pdf2=options.pdf2)
    result_flag = test_obj.get_pdf_diff()
    if result_flag == True:
        print ('The two PDF matched properly')
    else:
        print ('The PDFs didnt match properly, check the diff file generated')

Putting it all together

Here is how our utility to compare two PDFs look.

from PIL import Image, ImageChops
import os,time,PythonMagick,subprocess,shutil
from optparse import OptionParser
 
class PDF_Image_Compare:    
    "Compare's two pdf files"
    def __init__(self,pdf1,pdf2):
        "Constructor: Initialises file1 and file 2"
        self.download_dir = os.getcwd()
        self.pdf1 = pdf1
        self.pdf2 = pdf2
 
 
    def get_image_list_from_pdf(self,pdf_file):
        "Return a list of images that resulted from running convert on a given pdf"
        pdf_name = pdf_file.split(os.sep)[-1].split('.pdf')[0]
        pdf_dir = pdf_file.split(pdf_name)[0]
        jpg = pdf_file.split('.pdf')[0]+'.jpg'
        # Convert the pdf file to jpg file
        self.call_convert(pdf_file,jpg)
        #Get all the jpg files after calling convert and store it in a list
        image_list = []        
        file_list = os.listdir(pdf_dir)
        for f in file_list:
            if f[-4:]=='.jpg' and pdf_name in f:
                #Make sure the file names of both pdf are not similar
                image_list.append(f)
 
        print('Total of %d jpgs produced after converting the pdf file: %s'%(len(image_list),pdf_file))
        return image_list
 
 
    def call_convert(self,src,dest):
        "Call convert to convert pdf to jpg"
        print('About to call convert on %s'%src)
        try:
            subprocess.check_call(["convert",src,dest], shell=True)
        except Exception,e:
            print('Convert exception ... could be an ImageMagick bug')
            print(e)
        print('Finished calling convert on %s'%src)
 
 
    def create_diff_image(self,pdf1_list,pdf2_list,diff_image_dir):
        "Creates the diffed images in diff image directory and generates a pdf by calling call convert"
        result_flag = True
        for pdf1_img,pdf2_img in zip(pdf1_list,pdf2_list):
            diff_filename = diff_image_dir + os.sep+'diff_' + pdf2_img
            try:
                pdf2_image = Image.open(self.download_dir +os.sep+ pdf2_img)
                pdf1_image = Image.open(self.download_dir+os.sep + pdf1_img)
                diff = ImageChops.difference(pdf2_image,pdf1_image)
                diff.save(diff_filename)
 
                if (ImageChops.difference(pdf2_image,pdf1_image).getbbox() is None):
                    result_flag = result_flag & True
                else:
                    result_flag = result_flag & False
                    print ('The file didnt match for: \n>>%s\nand\n>>%s'%(self.download_dir +os.sep+ pdf2_img,self.download_dir +os.sep+ pdf1_img))
            except Exception,e:
                print('Error when trying to open image')
                result_flag = result_flag & False
        #Create a pdf out of all the jpgs created
        diff_pdf_name = 'diff_'+pdf2_img.split('.jpg')[0]+'.pdf'
        self.call_convert(diff_image_dir+os.sep+'*.jpg', self.download_dir+os.sep+diff_pdf_name)
 
        if os.path.exists(diff_pdf_name):
            print('Successfully created the difference pdf: %s'%(diff_pdf_name))
 
        return result_flag
 
 
    def cleanup(self,diff_image_dir,pdf1_list,pdf2_list):
        "Clean up all the image files created"
        print('Cleaning up all the intermediate jpg files created when comparing the pdf')
        for pdf1_img,pdf2_img in zip(pdf1_list,pdf2_list):
            try:
                os.remove(self.download_dir +os.sep+ pdf1_img)
                os.remove(self.download_dir +os.sep+ pdf2_img)
            except Exception,e:
                print('Unable to delete jpg file')
                print(e)
        print('Nuking the temporary image_diff directory')
        try:
            time.sleep(5)
            shutil.rmtree(diff_image_dir)
        except Exception,e:
            print('Could not delete the image_diff directory')
            print(e)
 
 
    def get_pdf_diff(self,cleanup=True):
        "Create a difference pdf by overlaying the two pdfs and generating an image difference.Returns True if the file matches else returns false"
 
        # Initialize result_flag to False
        result_flag = False
 
        #Get the list of images using get_image_list_from_pdf which inturn calls convert on a given pdf  
        pdf1_list = self.get_image_list_from_pdf(self.pdf1)
        pdf2_list = self.get_image_list_from_pdf(self.pdf2)
 
        #If diff directory already does exist - delete it 
        #Easier to simply nuke the folder and create it again than to check if its empty
        diff_image_dir = self.download_dir + os.sep+'diff_images'
        if os.path.exists(diff_image_dir):
            print('diff_images directory exists ... about to nuke it')
            shutil.rmtree(diff_image_dir)
 
        #Create a new and empty diff directory
        os.mkdir(diff_image_dir)
        print('diff_images directory created')
        print('Total pages in pdf2: %d'%len(pdf2_list))
        print('Total pages in pdf1 : %d'%len(pdf1_list))
 
        #Verify that there are equal number pages in pdf1 and pdf2
        if len(pdf2_list)==len(pdf1_list) and len(pdf2_list) !=0:
            print('Check SUCCEEDED: There are an equal number of jpgs created from the pdf generated from pdf2 and pdf1')
            print('Total pages in images: %d'%len(pdf2_list))
            pdf1_list.sort()
            pdf2_list.sort()
 
            #Create the diffed images
            result_flag = self.create_diff_image(pdf1_list,pdf2_list,diff_image_dir)
        else:
            print('Check FAILED: There are an unequal number of jpgs created from the pdf generated from pdf2 and pdf1')
            print('Total pages in image2 : %d'%len(pdf2_list))
            print('Total pages in image1: %d'%len(pdf1_list))
            print('ERROR: Skipping image comparison between %s and %s'%(self.pdf1,self.pdf2))
 
        if cleanup:
            #Delete all the image files created
            self.cleanup(diff_image_dir,pdf1_list,pdf2_list)            
 
        return result_flag
 
if __name__== '__main__':
    #Lets accept command line options for the location of two PDF files from the user 
    #We have chosen to use the Python module optparse 
    usage = "usage: %prog --f1 <pdf1> --f2 <pdf2>\nE.g.: %prog --f1 'D:\Image Compare\Sample.pdf' --f2 'D:\Image Compare\Test.pdf'\n---"
    parser = OptionParser(usage=usage)
    parser.add_option("--f1","--pdf1",dest="pdf1",help="The location of pdf file1",default=None)
    parser.add_option("--f2","--pdf2",dest="pdf2",help="The location of pdf file2",default=None)
    (options,args) = parser.parse_args()
 
    test_obj = PDF_Image_Compare(pdf1=options.pdf1,pdf2=options.pdf2)
    result_flag = test_obj.get_pdf_diff()
    if result_flag == True:
        print ('The two PDF matched properly')
    else:
        print ('The PDFs didnt match properly, check the diff file generated')

from PIL import Image, ImageChops import os,time,PythonMagick,subprocess,shutil from optparse import OptionParser class PDF_Image_Compare: "Compare's two pdf files" def __init__(self,pdf1,pdf2): "Constructor: Initialises file1 and file 2" self.download_dir = os.getcwd() self.pdf1 = pdf1 self.pdf2 = pdf2 def get_image_list_from_pdf(self,pdf_file): "Return a list of images that resulted from running convert on a given pdf" pdf_name = pdf_file.split(os.sep)[-1].split('.pdf')[0] pdf_dir = pdf_file.split(pdf_name)[0] jpg = pdf_file.split('.pdf')[0]+'.jpg' # Convert the pdf file to jpg file self.call_convert(pdf_file,jpg) #Get all the jpg files after calling convert and store it in a list image_list = [] file_list = os.listdir(pdf_dir) for f in file_list: if f[-4:]=='.jpg' and pdf_name in f: #Make sure the file names of both pdf are not similar image_list.append(f) print('Total of %d jpgs produced after converting the pdf file: %s'%(len(image_list),pdf_file)) return image_list def call_convert(self,src,dest): "Call convert to convert pdf to jpg" print('About to call convert on %s'%src) try: subprocess.check_call(["convert",src,dest], shell=True) except Exception,e: print('Convert exception ... could be an ImageMagick bug') print(e) print('Finished calling convert on %s'%src) def create_diff_image(self,pdf1_list,pdf2_list,diff_image_dir): "Creates the diffed images in diff image directory and generates a pdf by calling call convert" result_flag = True for pdf1_img,pdf2_img in zip(pdf1_list,pdf2_list): diff_filename = diff_image_dir + os.sep+'diff_' + pdf2_img try: pdf2_image = Image.open(self.download_dir +os.sep+ pdf2_img) pdf1_image = Image.open(self.download_dir+os.sep + pdf1_img) diff = ImageChops.difference(pdf2_image,pdf1_image) diff.save(diff_filename) if (ImageChops.difference(pdf2_image,pdf1_image).getbbox() is None): result_flag = result_flag & True else: result_flag = result_flag & False print ('The file didnt match for: \n>>%s\nand\n>>%s'%(self.download_dir +os.sep+ pdf2_img,self.download_dir +os.sep+ pdf1_img)) except Exception,e: print('Error when trying to open image') result_flag = result_flag & False #Create a pdf out of all the jpgs created diff_pdf_name = 'diff_'+pdf2_img.split('.jpg')[0]+'.pdf' self.call_convert(diff_image_dir+os.sep+'*.jpg', self.download_dir+os.sep+diff_pdf_name) if os.path.exists(diff_pdf_name): print('Successfully created the difference pdf: %s'%(diff_pdf_name)) return result_flag def cleanup(self,diff_image_dir,pdf1_list,pdf2_list): "Clean up all the image files created" print('Cleaning up all the intermediate jpg files created when comparing the pdf') for pdf1_img,pdf2_img in zip(pdf1_list,pdf2_list): try: os.remove(self.download_dir +os.sep+ pdf1_img) os.remove(self.download_dir +os.sep+ pdf2_img) except Exception,e: print('Unable to delete jpg file') print(e) print('Nuking the temporary image_diff directory') try: time.sleep(5) shutil.rmtree(diff_image_dir) except Exception,e: print('Could not delete the image_diff directory') print(e) def get_pdf_diff(self,cleanup=True): "Create a difference pdf by overlaying the two pdfs and generating an image difference.Returns True if the file matches else returns false" # Initialize result_flag to False result_flag = False #Get the list of images using get_image_list_from_pdf which inturn calls convert on a given pdf pdf1_list = self.get_image_list_from_pdf(self.pdf1) pdf2_list = self.get_image_list_from_pdf(self.pdf2) #If diff directory already does exist - delete it #Easier to simply nuke the folder and create it again than to check if its empty diff_image_dir = self.download_dir + os.sep+'diff_images' if os.path.exists(diff_image_dir): print('diff_images directory exists ... about to nuke it') shutil.rmtree(diff_image_dir) #Create a new and empty diff directory os.mkdir(diff_image_dir) print('diff_images directory created') print('Total pages in pdf2: %d'%len(pdf2_list)) print('Total pages in pdf1 : %d'%len(pdf1_list)) #Verify that there are equal number pages in pdf1 and pdf2 if len(pdf2_list)==len(pdf1_list) and len(pdf2_list) !=0: print('Check SUCCEEDED: There are an equal number of jpgs created from the pdf generated from pdf2 and pdf1') print('Total pages in images: %d'%len(pdf2_list)) pdf1_list.sort() pdf2_list.sort() #Create the diffed images result_flag = self.create_diff_image(pdf1_list,pdf2_list,diff_image_dir) else: print('Check FAILED: There are an unequal number of jpgs created from the pdf generated from pdf2 and pdf1') print('Total pages in image2 : %d'%len(pdf2_list)) print('Total pages in image1: %d'%len(pdf1_list)) print('ERROR: Skipping image comparison between %s and %s'%(self.pdf1,self.pdf2)) if cleanup: #Delete all the image files created self.cleanup(diff_image_dir,pdf1_list,pdf2_list) return result_flag if __name__== '__main__': #Lets accept command line options for the location of two PDF files from the user #We have chosen to use the Python module optparse usage = "usage: %prog --f1 <pdf1> --f2 <pdf2>\nE.g.: %prog --f1 'D:\Image Compare\Sample.pdf' --f2 'D:\Image Compare\Test.pdf'\n---" parser = OptionParser(usage=usage) parser.add_option("--f1","--pdf1",dest="pdf1",help="The location of pdf file1",default=None) parser.add_option("--f2","--pdf2",dest="pdf2",help="The location of pdf file2",default=None) (options,args) = parser.parse_args() test_obj = PDF_Image_Compare(pdf1=options.pdf1,pdf2=options.pdf2) result_flag = test_obj.get_pdf_diff() if result_flag == True: print ('The two PDF matched properly') else: print ('The PDFs didnt match properly, check the diff file generated')

Run the utility file using command prompt

You can use the utility file any way you want. Below screenshot shows you how to compare two PDF files using command prompt by passing the location of the PDF files.

We have used this utility at a couple of our clients. We feel it fills a specific need (comparing image heavy PDF files) pretty well. Hope this post helps you do the same.

Avinash Shetty

I am a dedicated quality assurance professional with a true passion for ensuring product quality and driving efficient testing processes. Throughout my career, I have gained extensive expertise in various testing domains, showcasing my versatility in testing diverse applications such as CRM, Web, Mobile, Database, and Machine Learning-based applications. What sets me apart is my ability to develop robust test scripts, ensure comprehensive test coverage, and efficiently report defects. With experience in managing teams and leading testing-related activities, I foster collaboration and drive efficiency within projects. Proficient in tools like Selenium, Appium, Mechanize, Requests, Postman, Runscope, Gatling, Locust, Jenkins, CircleCI, Docker, and Grafana, I stay up-to-date with the latest advancements in the field to deliver exceptional software products. Outside of work, I find joy and inspiration in sports, maintaining a balanced lifestyle.

Avinash Shetty

54 thoughts on “How to compare PDFs using Python”

Hamer Basta says:

January 15, 2019 at 3:49 am

hello
i have question: is it possible to compare more than 2 PDF? And is it possible to automate it for a bunch of PDF’s

grt,

Reply
1. Rohan Dudam says:
  
  January 17, 2019 at 1:39 am
  
  Yes, it is possible to compare more than 2 PDF. But you need to keep one PDF as a base. And it is also possible to automate it for a bunch of PDF’s.
  Thanks
  
  Reply
Anonymous says:

February 15, 2019 at 2:15 am

need help.
unable to generate images.

C:\workspace\python Prj\Threads>python rd.py –f1 C:\Users\mohammad.irfan\Desktop\HelloWorld.pdf –f2 C:\Users\mohammad.irfan\Desktop\HelloWorld.pdf
About to call convert on C:\Users\mohammad.irfan\Desktop\HelloWorld.pdf
Finished calling convert on C:\Users\mohammad.irfan\Desktop\HelloWorld.pdf
Total of 1 jpgs produced after converting the pdf file: C:\Users\mohammad.irfan\Desktop\HelloWorld.pdf
About to call convert on C:\Users\mohammad.irfan\Desktop\HelloWorld.pdf
Finished calling convert on C:\Users\mohammad.irfan\Desktop\HelloWorld.pdf
Total of 1 jpgs produced after converting the pdf file: C:\Users\mohammad.irfan\Desktop\HelloWorld.pdf
diff_images directory created
Total pages in pdf2: 1
Total pages in pdf1 : 1
Check SUCCEEDED: There are an equal number of jpgs created from the pdf generated from pdf2 and pdf1
Total pages in images: 1
Inside create diff image.. with args : – [‘HelloWorld.jpg’] [‘HelloWorld.jpg’]
Error when trying to open image
About to call convert on C:\workspace\python Prj\Threads\diff_images\*.jpg
convert: unable to open image ‘C:\workspace\python Prj\Threads\diff_images\*.jpg’: Invalid argument @ error/blob.c/OpenBlob/3485.
convert: no images defined `C:\workspace\python Prj\Threads\diff_HelloWorld.pdf’ @ error/convert.c/ConvertImageCommand/3300.
Convert exception … could be an ImageMagick bug
Command ‘[‘convert’, ‘C:\\workspace\\python Prj\\Threads\\diff_images\\*.jpg’, ‘C:\\workspace\\python Prj\\Threads\\diff_HelloWorld.pdf’]’ returned non-zero exit status
1.
Finished calling convert on C:\workspace\python Prj\Threads\diff_images\*.jpg
Cleaning up all the intermediate jpg files created when comparing the pdf
Unable to delete jpg file
[WinError 2] The system cannot find the file specified: ‘C:\\workspace\\python Prj\\Threads\\HelloWorld.jpg’
Nuking the temporary image_diff directory
The PDFs didnt match properly, check the diff file generated

Reply
Mahesh yadav says:

June 25, 2019 at 4:10 am

Hi All,

Thanks for everything,
I’m facing the issues, please help me on this.
Everything is working fine . But diff PDF contains all pages as black.
Can anyone please help on this issue.

Thanks & Regrads,
Mahesh

Reply
1. Indira Nellutla says:
  
  June 25, 2019 at 10:28 pm
  
  Hi Mahesh,
  
  Could you please provide more details about the issue.
  
  Thanks
  Indira Nellutla
  
  Reply
Kiran says:

September 17, 2019 at 2:41 am

Hi,
Program works very nice. Thanks for your efforts.
I am getting a black image as a result by leaving the difference in content untouched.
Instead, how can I get the result pdf by highlighting the difference inside a rectangle by retaining the other content of the pdf as it is.

Please suggest !!

Reply
Kiran says:

September 17, 2019 at 3:05 am

Currently, the result is generation of a pdf file with black images. Instead of that, can we get a result pdf by highlighting the content difference inside a rectangle. Please suggest !!

Reply
1. Smitha Rajesh says:
  
  September 20, 2019 at 12:49 am
  
  Hi Kiran,
  Can you add the rectangle by referring to this link https://pillow.readthedocs.io/en/3.1.x/reference/ImageDraw.html where the difference happens and call the method when generating the file?
  
  Regards,
  Smitha
  
  Reply
  1. Kiran says:
    
    September 21, 2019 at 4:30 am
    
    Thank you Smitha,
    I am always getting the exception as “Error when trying to open image” on execution of the function
    diff = ImageChops.difference(pdf2_image,pdf1_image) .
    
    ####################
    pdf2_image = Image.open(download_dir +os.sep+ pdf2_img)
    pdf1_image = Image.open(download_dir+os.sep + pdf1_img)
    diff = ImageChops.difference(pdf2_image,pdf1_image)
    ####################
    Separate images have been created from every page of the pdf document.
    
    I also tried using Open CV.
    pdf1_image = cv2.imread(download_dir+os.sep + pdf1_img)
    pdf2_image = cv2.imread(download_dir +os.sep+ pdf2_img)
    pdf1_image = cv2.cvtColor(pdf1_image, cv2.COLOR_BGR2GRAY)
    pdf2_image = cv2.cvtColor(pdf2_image, cv2.COLOR_BGR2GRAY)
    
    cv2.imshow(“pdf1_image”, pdf1_image)
    cv2.imshow(“pdf2_image”, pdf2_image)
    cv2.waitKey(0)
    -> Able to open the image too.
    but the function (score,diff) = compare_ssim(pdf1_image, pdf2_image, full = True) doesnt work.
    Please advise.
    
    regards,
    Kiran
  2. Shivahari P says:
    
    September 26, 2019 at 2:14 am
    
    Hi,
    The Error when trying to open image error is very generic. Try printing the error in the except block for create_diff_image method using print(str(e)) statement instead of print('Error when trying to open image').
    Also what does the function compare_ssim do?
Richard Carmona says:

January 31, 2020 at 4:41 pm

Hello I am trying to run this but my current issue is that PythonMagick is included as a package in the source code but I am unable to install it, I am using Python 3.7, does that matter? Thank you.

I have also installed ImageMagick and GhostScript

Reply
1. Indira Nellutla says:
  
  February 4, 2020 at 7:25 am
  
  Hi Richard,
  
  I hope you have downloaded the correct version of WHL from below url – https://www.lfd.uci.edu/~gohlke/pythonlibs/#pythonmagick
  Could you please elaborate the issue you are facing when installing?
  
  Reply
  1. Swati says:
    
    July 23, 2020 at 6:53 am
    
    Hi Indira,
    I have used the PythonMagic.whl from the url given by you. I have installed wheel via pip and when I am installing PythonMagic via pip by going on the path where PythonMagic.whl files is downloaded. In CLI , it is giving me below error:
    ERROR: PythonMagick-0.9.19-cp38-cp38-win32.whl is not a supported wheel on this platform.
  2. Smitha Rajesh says:
    
    July 24, 2020 at 1:07 am
    
    Hi Swati,
    
    Can you please share the exact command that you used to install PythonMagick?
  3. Nilaya Indurkar says:
    
    July 24, 2020 at 1:35 am
    
    Hi Swati,
    
    Please check if this link helps you to solve the issue.
    
    https://stackoverflow.com/questions/41353360/unable-to-install-pythonmagick-on-windows-10
    
    Thanks,
    Nilaya
Ramesh says:

September 24, 2020 at 2:49 am

When i am trying to execute i am getting the below error. Can you please help me solve it.

python pdf_files_comparisions.py –f1 “U:\Datagaps\image_Differences\pdfs\Trellis1_Pre.pdf” –f2 “U:\Datagaps\image_Differe
nces\pdfs\Trellis1_Post.pdf”
About to call convert on U:\Datagaps\image_Differences\pdfs\Trellis1_Pre.pdf
Invalid Parameter – U:\Datagaps\image_Differences\pdfs\Trellis1_Pre.jpg
Convert exception … could be an ImageMagick bug
Finished calling convert on U:\Datagaps\image_Differences\pdfs\Trellis1_Pre.pdf
Total of 0 jpgs produced after converting the pdf file: U:\Datagaps\image_Differ
ences\pdfs\Trellis1_Pre.pdf
About to call convert on U:\Datagaps\image_Differences\pdfs\Trellis1_Post.pdf
Invalid Parameter – U:\Datagaps\image_Differences\pdfs\Trellis1_Post.jpg
Convert exception … could be an ImageMagick bug
Finished calling convert on U:\Datagaps\image_Differences\pdfs\Trellis1_Post.pdf

Total of 0 jpgs produced after converting the pdf file: U:\Datagaps\image_Differ
ences\pdfs\Trellis1_Post.pdf
diff_images directory created
Total pages in pdf2: 0
Total pages in pdf1 : 0
Check FAILED: There are an unequal number of jpgs created from the pdf generated
from pdf2 and pdf1
Total pages in image2 : 0
Total pages in image1: 0
ERROR: Skipping image comparison between U:\Datagaps\image_Differences\pdfs\Trel
lis1_Pre.pdf and U:\Datagaps\image_Differences\pdfs\Trellis1_Post.pdf
Cleaning up all the intermediate jpg files created when comparing the pdf
Nuking the temporary image_diff directory
Traceback (most recent call last):
File “pdf_files_comparisions.py”, line 150, in
result_flag = test_obj.get_pdf_diff()
File “pdf_files_comparisions.py”, line 138, in get_pdf_diff
return result_flag
UnboundLocalError: local variable ‘result_flag’ referenced before assignment.

Reply
1. Rahul Bhave says:
  
  September 25, 2020 at 3:44 am
  
  Hi Ramesh,
  I think the problem here is ImageMagick has failed to convert the PDF into the JPEG file.You can refer below the line from the error log you have given.
  `Convert exception … could be an ImageMagick bug`
  Once this issue is resolved you may be able to proceed with code.I would suggest you to check the ImageMagick site for similar problems and their solutions. One such thread I noticed is as below:
  `www.imagemagick.org/discourse-server/viewtopic.php?t=35171`
  Also, you can also try with different ImageMagick version to check if the problem persists there as well.
  
  Regards,
  Rahul
  
  Reply
2. Rahul says:
  
  September 5, 2021 at 10:19 am
  
  Hey Ramesh,
  
  Are you able to resolve the above issue. I am seeing similar issue and did not find any resolution.
  
  Reply
Anonymous says:

November 1, 2020 at 1:07 pm

Hi , Could you please help with the compatible versions of python + ImageMagick + GhostScript to run this code on Windows 64-bit machine. Bear with me as new to python

Reply
1. Rohini Gopal says:
  
  November 2, 2020 at 6:45 am
  
  hi,
  
  Based on the pixels of the image, you can select ImageMagick from http://www.imagemagick.org/script/download.php#windows. And GhostScript can be downloaded from https://www.ghostscript.com/download/gsdnld.html
  Regarding the Python version, the blog was written with python2.7 but it would work fine with newer versions of python too.
  
  Regards,
  Rohini
  
  Reply
Vijay DR says:

July 28, 2021 at 12:09 am

ERROR
line 16, in get_image_list_from_pdf
pdf_name = pdf_file.split(os.sep)[-1].split(‘.pdf’)[0]
AttributeError: ‘NoneType’ object has no attribute ‘split’

Reply
1. Shivahari P says:
  
  July 30, 2021 at 1:10 am
  
  Hi,
  is the pdf_file variable set?
  This is what the step does,
  >>> pdf_name = 'mypdf.pdf'.split(os.sep)[-1].split('.pdf')[0] >>> >>> pdf_name 'mypdf'
  
  Thanks
  
  Reply
  1. dhruv says:
    
    December 14, 2021 at 10:23 am
    
    hello can some one tell me the right way to pass the pdf location in usage arguments.
    usage = “usage: %prog –f1 ‘C:\\Users\dhruv\Downloads\samp1.pdf’ –f2 ‘C:\\Users\dhruv\Downloads\samp2.pdf'”
    parser = OptionParser(usage=usage)
    
    I tried this but it’s not working and I’m getting the same error as
    AttributeError: ‘NoneType’ object has no attribute ‘split’
  2. Annapoorani Gurusamy says:
    
    December 17, 2021 at 1:37 am
    
    Hi Dhruv,
    I can see from your comments in the given path there is a double slash after ‘C:’. Have you tried giving a single slash and checked running the script? If not please try with a valid path.
    
    Regards
    Annapoorani
dhruv says:

December 17, 2021 at 9:45 am

I did try with a single slash as well. But it didn’t work. I have tried valid paths and everything but to no avail.

Reply
1. Rahul Bhave says:
  
  December 17, 2021 at 12:32 pm
  
  Hi Dhruv,
  
  Is the pdf_file variable set properly in your code? Are you able to get pdf_name correctly after running following step:
  
  pdf_name = pdf_file.split(os.sep)[-1].split(‘.pdf’)[0]
  
  You can print the output after this step or use IPython command shell.
  
  Regards,
  Rahul
  
  Reply

How to compare PDFs using Python

How to compare PDFs using Python

Steps involved

Putting it all together

Run the utility file using command prompt

54 thoughts on “How to compare PDFs using Python”

Leave a Reply Cancel reply

Subscribe to our weekly Newsletter

Steps involved

Putting it all together

Run the utility file using command prompt

Related posts:

54 thoughts on “How to compare PDFs using Python”

Leave a Reply Cancel reply

You may like this....