How to compare PDFs using Python

Problem:How do you compare two PDF files programmatically using Python?

Adobe makes it easy to compare the changes in two PDF files. However as testers, we sometimes need to compare a lot of PDF files (especially reports!) against some preset baselines. In these cases, it helps to have a script that can compare PDF files and tell you if they differ in any way.

There are several options. We like DiffPDF, pdf2text, the pdf-diff python module. Each option comes with its own set of pros and cons. Most of these solutions do a good job of comparing the text in the PDF files. However we noticed that they are somewhat lacking when it comes to comparing graphs and charts. In this post, we show you one more approach which is useful if you have a lot of graphs and charts in your PDF files.

WARNING: You should be using this kind of automated check as a last resort. Ask your developers for other ways to check the data/content of the PDF files before using this approach.


Steps involved

We will be using image comparison to verify if the two PDF files are identical or not. To do so, we need to:
1. Get setup with ImageMagick and Ghostscript
2. Convert each page of the PDF file into one image
3. Compare corresponding images and save the resulting difference image for every page
4. Stitch all the resulting difference images into a single PDF file
5. Use the utility to compare two PDF files

I have created a class PDF_Image_Compare which can be used to compare two PDFs. The class will help you compare two PDF files, list out which pages differ and give you a overlaid images of the two PDF files. Below few steps will explain the different methods and modules which are required to compare two PDF files.


Step 1. Get setup with ImageMagick and Ghostscript
The first step is to convert the PDF file to a different format like jpg. We will use ImageMagick, which in turn uses Ghostscript. To do this you need to:
a. Download and install ImageMagick which is a software suite to create, edit, compose, or convert bitmap images
b. ImageMagick needs Ghostscript which is an interpreter for the PostScript language and for PDF.
c. Add both ImageMagick and GhostScript to your path environment variable.
d. Verify you are setup correctly by using the “convert” utility. Open a command prompt and run the command ‘convert file.pdf file.jpg’ to convert file.pdf into a file.jpg.


Step 2. Convert each page of the PDF file into one image
We plan to use the difference method in Imagechops module which returns the absolute value of the difference between the two images. However we can’t use it directly on a PDF file. So we first have to convert the PDF into a list of images. To do so, we will call convert from Python using the subprocess module.

    def get_image_list_from_pdf(self,pdf_file):
        "Return a list of images that resulted from running convert on a given pdf"
        pdf_name = pdf_file.split(os.sep)[-1].split('.pdf')[0]
        pdf_dir = pdf_file.split(pdf_name)[0]
        jpg = pdf_file.split('.pdf')[0]+'.jpg'
        # Convert the pdf file to jpg file
        self.call_convert(pdf_file,jpg)
        #Get all the jpg files after calling convert and store it in a list
        image_list = []        
        file_list = os.listdir(pdf_dir)
        for f in file_list:
            if f[-4:]=='.jpg' and pdf_name in f:
                #Make sure the file names of both pdf are not similar
                image_list.append(f)
 
        print('Total of %d jpgs produced after converting the pdf file: %s'%(len(image_list),pdf_file))
        return image_list
 
 
    def call_convert(self,src,dest):
        "Call convert to convert pdf to jpg"
        print('About to call convert on %s'%src)
        try:
            subprocess.check_call(["convert",src,dest], shell=True)
        except Exception,e:
            print('Convert exception ... could be an ImageMagick bug')
            print(e)
        print('Finished calling convert on %s'%src)

Step 3. Compare corresponding images and save the resulting difference image for every page
Now we can use the ImageChops.difference() method to compare the images from the list of images created.
The below method would help you achieve this. It will return result_flag as True if the images match and False if they do not. The method will also print out the image pairs that differ.

def create_diff_image(self,pdf1_list,pdf2_list,diff_image_dir):
        "Creates the diffed images in diff image directory and generates a pdf by calling call convert"
        result_flag = True
        for pdf1_img,pdf2_img in zip(pdf1_list,pdf2_list):
            diff_filename = diff_image_dir + os.sep+'diff_' + pdf2_img
            try:
                pdf2_image = Image.open(self.download_dir +os.sep+ pdf2_img)
                pdf1_image = Image.open(self.download_dir+os.sep + pdf1_img)
                diff = ImageChops.difference(pdf2_image,pdf1_image)
                diff.save(diff_filename)
 
                if (ImageChops.difference(pdf2_image,pdf1_image).getbbox() is None):
                    result_flag = result_flag & True
                else:
                    result_flag = result_flag & False
                    print ('The file didnt match for: \n>>%s\nand\n>>%s'%(self.download_dir +os.sep+ pdf2_img,self.download_dir +os.sep+ pdf1_img))
            except Exception,e:
                print('Error when trying to open image')
                result_flag = result_flag & False
 
        return result_flag

Step 4. Stitch all the resulting difference images into a single PDF file
We found it useful to present the final difference as one PDF file rather than as a series of images. This makes it easier for the human interpreting the results to quickly identify and summarize the differences. To do so, add the below code to the create_diff_image method above.

        #Create a pdf out of all the jpgs created
        diff_pdf_name = 'diff_'+pdf2_img.split('.jpg')[0]+'.pdf'
        self.call_convert(diff_image_dir+os.sep+'*.jpg', self.download_dir+os.sep+diff_pdf_name)
 
        if os.path.exists(diff_pdf_name):
            print('Successfully created the difference pdf: %s'%(diff_pdf_name))

Step 5. Use the utility to compare two PDF files

 
if __name__== '__main__':
    #Lets accept command line options for the location of two PDF files from the user 
    #We have chosen to use the Python module optparse 
    usage = "usage: %prog --f1 <pdf1> --f2 <pdf2>\nE.g.: %prog --f1 'D:\Image Compare\Sample.pdf' --f2 'D:\Image Compare\Test.pdf'\n---"
    parser = OptionParser(usage=usage)
    parser.add_option("--f1","--pdf1",dest="pdf1",help="The location of pdf file1",default=None)
    parser.add_option("--f2","--pdf2",dest="pdf2",help="The location of pdf file2",default=None)
    (options,args) = parser.parse_args()
 
    test_obj = PDF_Image_Compare(pdf1=options.pdf1,pdf2=options.pdf2)
    result_flag = test_obj.get_pdf_diff()
    if result_flag == True:
        print ('The two PDF matched properly')
    else:
        print ('The PDFs didnt match properly, check the diff file generated')

Putting it all together

Here is how our utility to compare two PDFs look.

from PIL import Image, ImageChops
import os,time,PythonMagick,subprocess,shutil
from optparse import OptionParser
 
class PDF_Image_Compare:    
    "Compare's two pdf files"
    def __init__(self,pdf1,pdf2):
        "Constructor: Initialises file1 and file 2"
        self.download_dir = os.getcwd()
        self.pdf1 = pdf1
        self.pdf2 = pdf2
 
 
    def get_image_list_from_pdf(self,pdf_file):
        "Return a list of images that resulted from running convert on a given pdf"
        pdf_name = pdf_file.split(os.sep)[-1].split('.pdf')[0]
        pdf_dir = pdf_file.split(pdf_name)[0]
        jpg = pdf_file.split('.pdf')[0]+'.jpg'
        # Convert the pdf file to jpg file
        self.call_convert(pdf_file,jpg)
        #Get all the jpg files after calling convert and store it in a list
        image_list = []        
        file_list = os.listdir(pdf_dir)
        for f in file_list:
            if f[-4:]=='.jpg' and pdf_name in f:
                #Make sure the file names of both pdf are not similar
                image_list.append(f)
 
        print('Total of %d jpgs produced after converting the pdf file: %s'%(len(image_list),pdf_file))
        return image_list
 
 
    def call_convert(self,src,dest):
        "Call convert to convert pdf to jpg"
        print('About to call convert on %s'%src)
        try:
            subprocess.check_call(["convert",src,dest], shell=True)
        except Exception,e:
            print('Convert exception ... could be an ImageMagick bug')
            print(e)
        print('Finished calling convert on %s'%src)
 
 
    def create_diff_image(self,pdf1_list,pdf2_list,diff_image_dir):
        "Creates the diffed images in diff image directory and generates a pdf by calling call convert"
        result_flag = True
        for pdf1_img,pdf2_img in zip(pdf1_list,pdf2_list):
            diff_filename = diff_image_dir + os.sep+'diff_' + pdf2_img
            try:
                pdf2_image = Image.open(self.download_dir +os.sep+ pdf2_img)
                pdf1_image = Image.open(self.download_dir+os.sep + pdf1_img)
                diff = ImageChops.difference(pdf2_image,pdf1_image)
                diff.save(diff_filename)
 
                if (ImageChops.difference(pdf2_image,pdf1_image).getbbox() is None):
                    result_flag = result_flag & True
                else:
                    result_flag = result_flag & False
                    print ('The file didnt match for: \n>>%s\nand\n>>%s'%(self.download_dir +os.sep+ pdf2_img,self.download_dir +os.sep+ pdf1_img))
            except Exception,e:
                print('Error when trying to open image')
                result_flag = result_flag & False
        #Create a pdf out of all the jpgs created
        diff_pdf_name = 'diff_'+pdf2_img.split('.jpg')[0]+'.pdf'
        self.call_convert(diff_image_dir+os.sep+'*.jpg', self.download_dir+os.sep+diff_pdf_name)
 
        if os.path.exists(diff_pdf_name):
            print('Successfully created the difference pdf: %s'%(diff_pdf_name))
 
        return result_flag
 
 
    def cleanup(self,diff_image_dir,pdf1_list,pdf2_list):
        "Clean up all the image files created"
        print('Cleaning up all the intermediate jpg files created when comparing the pdf')
        for pdf1_img,pdf2_img in zip(pdf1_list,pdf2_list):
            try:
                os.remove(self.download_dir +os.sep+ pdf1_img)
                os.remove(self.download_dir +os.sep+ pdf2_img)
            except Exception,e:
                print('Unable to delete jpg file')
                print(e)
        print('Nuking the temporary image_diff directory')
        try:
            time.sleep(5)
            shutil.rmtree(diff_image_dir)
        except Exception,e:
            print('Could not delete the image_diff directory')
            print(e)
 
 
    def get_pdf_diff(self,cleanup=True):
        "Create a difference pdf by overlaying the two pdfs and generating an image difference.Returns True if the file matches else returns false"
 
        #Get the list of images using get_image_list_from_pdf which inturn calls convert on a given pdf  
        pdf1_list = self.get_image_list_from_pdf(self.pdf1)
        pdf2_list = self.get_image_list_from_pdf(self.pdf2)
 
        #If diff directory already does exist - delete it 
        #Easier to simply nuke the folder and create it again than to check if its empty
        diff_image_dir = self.download_dir + os.sep+'diff_images'
        if os.path.exists(diff_image_dir):
            print('diff_images directory exists ... about to nuke it')
            shutil.rmtree(diff_image_dir)
 
        #Create a new and empty diff directory
        os.mkdir(diff_image_dir)
        print('diff_images directory created')
        print('Total pages in pdf2: %d'%len(pdf2_list))
        print('Total pages in pdf1 : %d'%len(pdf1_list))
 
        #Verify that there are equal number pages in pdf1 and pdf2
        if len(pdf2_list)==len(pdf1_list) and len(pdf2_list) !=0:
            print('Check SUCCEEDED: There are an equal number of jpgs created from the pdf generated from pdf2 and pdf1')
            print('Total pages in images: %d'%len(pdf2_list))
            pdf1_list.sort()
            pdf2_list.sort()
 
            #Create the diffed images
            result_flag = self.create_diff_image(pdf1_list,pdf2_list,diff_image_dir)
        else:
            print('Check FAILED: There are an unequal number of jpgs created from the pdf generated from pdf2 and pdf1')
            print('Total pages in image2 : %d'%len(pdf2_list))
            print('Total pages in image1: %d'%len(pdf1_list))
            print('ERROR: Skipping image comparison between %s and %s'%(self.pdf1,self.pdf2))
 
        if cleanup:
            #Delete all the image files created
            self.cleanup(diff_image_dir,pdf1_list,pdf2_list)            
 
        return result_flag
 
if __name__== '__main__':
    #Lets accept command line options for the location of two PDF files from the user 
    #We have chosen to use the Python module optparse 
    usage = "usage: %prog --f1 <pdf1> --f2 <pdf2>\nE.g.: %prog --f1 'D:\Image Compare\Sample.pdf' --f2 'D:\Image Compare\Test.pdf'\n---"
    parser = OptionParser(usage=usage)
    parser.add_option("--f1","--pdf1",dest="pdf1",help="The location of pdf file1",default=None)
    parser.add_option("--f2","--pdf2",dest="pdf2",help="The location of pdf file2",default=None)
    (options,args) = parser.parse_args()
 
    test_obj = PDF_Image_Compare(pdf1=options.pdf1,pdf2=options.pdf2)
    result_flag = test_obj.get_pdf_diff()
    if result_flag == True:
        print ('The two PDF matched properly')
    else:
        print ('The PDFs didnt match properly, check the diff file generated')

Run the utility file using command prompt

You can use the utility file any way you want. Below screenshot shows you how to compare two PDF files using command prompt by passing the location of the PDF files.

Running the Test


We have used this utility at a couple of our clients. We feel it fills a specific need (comparing image heavy PDF files) pretty well. Hope this post helps you do the same.

Avinash Shetty

I am a software tester with over 14 years of experience in software testing. Currently, I am working at Qxf2 Services Bangalore. As a student of the context-driven approach to software testing, I feel there is a lot to learn out there which keeps me very excited. My work has helped me gain good experience in different areas of testing like CRM, Web, Mobile and Database testing. I have good knowledge of building test scripts using Automation tools like Selenium and Appium using Java and Python. Besides testing, I am a “Sports Fanatic” and love watching and playing sports.

41 Comments

  1. Anonymous said:

    Hi.
    Thanks a lot for your posts!! They are really helpful and helps keep us updated too..
    Wrt to the pdf use case, for me, the pdf gets opened in the browser. How do i read the pdf text from that? I am seeing lot of solutions using pdfminer/slate etc. But they dont seem to be appealing.. Do you have any suggestions on how i can search for a text without downloading the file?
    Thanks in advance for your time

    May 2, 2016
    Reply
    • @anonymous, sorry – I don’t know of a better way. As you must have realized, PDFs are hard for GUI automation to deal with. It becomes even more annoying when you need to interact with the PDF via a browser plugin. But if you do solve this problem, please do let us know!

      May 16, 2016
      Reply
  2. vinaya said:

    Hi
    Thanks for the post.. I tried to use pdf2text and got an error saying “ImportError: cannot import name process_pdf”. do you have any idea what i can do to overcome this?
    If you have any sample implementation using pdf2text, it would be helpful too
    thanks

    May 3, 2016
    Reply
  3. selva said:

    Hi Excellentwork. Can you please teach us how we can compare images from two folders using python.
    eg file1 will have image1,image2,image3,image4,image5 and file 2 will have image1,image2,image3,image4,image5. i wanted to compare from file one images with file2 image one by one and if they are different i want to print out the name of the file image.

    December 22, 2016
    Reply
    • Selva, you would:
      a) use os.listdir() to get the files within a folder
      b) repeat a) for your second folder
      c) sort the lists
      d) then loop – checking for the same filename and then compare.

      It’s more or less what is happening in get_pdf_diff method above. Let us know if you need more help with this.

      December 25, 2016
      Reply
      • selva said:

        Hi ArunKumar,
        Thanks for your help. I will try that. Please keep up your work. Excellent work. you have helped in the past to work with appium.
        Regards
        selva

        December 29, 2016
        Reply
  4. Anonymous said:

    hi Arun,

    I have implemented the code as below and the output I got as below. it doesn’t seem to work as expected. If I do one by one it works as expected.
    when I say it didn’t work as expected i have different image files in the folder and try and compare it always print as same.
    import cv2
    import numpy as np
    import os

    file1= os.listdir(“C:/Program Files (x86)/Python35-32/file1”)
    file2=os.listdir(“C:/Program Files (x86)/Python35-32/file2”)

    for f1 in file1:
    if f1.endswith(“.png”):
    print(“file1”,f1)
    image1 = cv2.imread(f1)
    for f2 in file2:
    if f2.endswith(“.png”):
    print(“file2”,f2)
    image2 = cv2.imread(f2)
    difference = cv2.subtract(image1, image2)
    result = not np.any(difference) #if difference is all zeros it will return False

    if result is True:
    print(“The images are the same”)
    else:
    cv2.imwrite(“result.jpg”, difference)
    print (“the images are different”)

    —–Out put—-
    file1 my_image.png
    file2 my_image.png
    file2 my_image15.png
    file1 my_image15.png
    file2 my_image.png
    file2 my_image15.png

    The images are the same

    January 1, 2017
    Reply
    • Hey Selva,

      Your code has lost it’s formatting. Can you enclose it between pre tags.

      BTW, I think you need a

      for f1,f2 in zip(file1,file2):
        #Do image compare of f1, f2 here

      But I can tell you that after looking at your formatted code.

      January 2, 2017
      Reply
  5. Anonymous said:

    Hi Arun,
    I have included in the pre TAG. I hope this what you said.
    pre {
    import cv2
    import numpy as np
    import os

    file1= os.listdir(“C:/Program Files (x86)/Python35-32/file1”)
    file2=os.listdir(“C:/Program Files (x86)/Python35-32/file2”)

    for f1 in file1:
    if f1.endswith(“.png”):
    print(“file1”,f1)
    image1 = cv2.imread(f1)
    for f2 in file2:
    if f2.endswith(“.png”):
    print(“file2”,f2)
    image2 = cv2.imread(f2)
    difference = cv2.subtract(image1, image2)
    result = not np.any(difference) #if difference is all zeros it will return False

    if result is True:
    print(“The images are the same”)
    else:
    cv2.imwrite(“result.jpg”, difference)
    print (“the images are different”)

    }

    January 2, 2017
    Reply
    • Selva,

      You need to try this:

      import cv2
      import numpy as np
      import os
       
      file1= os.listdir(“C:/Program Files (x86)/Python35-32/file1”)
      file2=os.listdir(“C:/Program Files (x86)/Python35-32/file2”)
       
      #Sorting the file lists
      #... I am assuming you have the same filenames in both directories
      file1 = file1.sort()
      file2 = file2.sort()
       
      for f1,f2 in zip(file1,file2):
          if f1.endswith('.png') and f2.endswith('.png'):
              print('file1',f1)
              image1 = cv2.imread(f1)
              print('file2',f2)
              image2 = cv2.imread(f2)
              difference = cv2.subtract(image1, image2)
              result = not np.any(difference) #if difference is all zeros it will return False
       
              if result is True:
                  print('The images are the same')
              else:
                  cv2.imwrite('result.jpg', difference)
                  print ('the images are different')
      January 9, 2017
      Reply
      • selva said:

        Hi Arun,
        Thanks for your help. Sorry for dealy response.
        Thanks
        selva

        February 11, 2017
        Reply
  6. Gupta said:

    Hello,
    I get ImportError: no module named ‘PythonMagick”. I installed ImageMagick and PythonMagick from the source. How did everyone else install theirs.

    September 15, 2017
    Reply
    • Shivahari P Shivahari P said:

      Hi,
      Pls follow step 5 from this post-“https://glenbambrick.com/tag/pythonmagick/” to pip install PythonMagick from local download directory.

      September 19, 2017
      Reply
  7. Philippe ENTZMANN said:

    Is there any github repo for this ode ?

    February 8, 2018
    Reply
    • Avinash Shetty Avinash Shetty said:

      No, we don’t have this code in github

      February 11, 2018
      Reply
  8. Pankaj said:

    Hi Avinash,
    Thanks for this and it really helped me.
    The issue I have is that, when it generates diff_file.pdf, the text on this file is not visible.
    It showing something which can’t be read. It is not showing actual text it is showing zoomed pixels or something like that.
    Can you plz guide me through that how can i make it visible??
    Thanks

    October 18, 2018
    Reply
    • Avinash Shetty Avinash Shetty said:

      Hi Pankaj,
      I am guessing that the visibility issue is not because of the overlay of text as the diff_file.pdf will be created by writing one on top of the other.
      Other reason may be because of the quality issue when converting pdf to jpg. For high-quality pdf to jpg conversion, you can probably use Image Magick density function while calling the convert method. Refer to this link

      October 18, 2018
      Reply
      • Pankaj said:

        Thank you Avinash, and it did work for me.
        I’m getting the quality image when it is converted from pdf to image.
        Bu when diff_image is created it is also getting same and I’m not able to find where can I make it visible.
        As ImageChops must be returning an diff_image (so far i know).
        So can you plz suggest the way to make it possible, plz…??

        Thank you

        October 19, 2018
        Reply
        • Indira Nellutla Indira Nellutla said:

          Hi Pankaj,
          I couldn’t follow your issue completely. If you are referring to diff_image not being clear, ImageChops.difference computes the ‘absolute value of the pixel-by-pixel difference between the two images’, which results in a difference image that is returned. So the diff_image may not be clear in some cases.
          If you are referring to the diff_image directory not being visible, you can probably try passing cleanup flag as false. We have a method to clear the image files created towards the end of the test.

          Reference:- https://stackoverflow.com/questions/32513311/definition-of-imagechops-difference

          October 23, 2018
          Reply
      • Aadee said:

        Hi Avinash ,

        I got a black background pdf. whats wrong ..with my trial? Please help

        February 15, 2019
        Reply
  9. Sandeep Jaju said:

    Hi Avinash,
    I followed your instructions but I was unsuccessful in getting positive result.
    Also I referred: https://glenbambrick.com/tag/pythonmagick/
    System configurations: Windows 10 64 bit
    Installed packages:
    ImageMagick 7.0.8-23 Q8 32 bit
    GhostScript 32 bit
    pythonmagick-0.9.10-cp27-none-win32.whl
    python version: 2.7 32 bit

    I am getting dll import error for pythonMagick.

    Error description:

    import PythonMagick

    Traceback (most recent call last):
    File “”, line 1, in
    import PythonMagick
    File “C:\Python27\lib\site-packages\PythonMagick\__init__.py”, line 1, in
    from . import _PythonMagick
    ImportError: DLL load failed: The application has failed to start because its side-by-side configuration is incorrect. Please see the application event log or use the command-line sxstrace.exe tool for more detail.

    Please suggest how to resolve this issue.

    January 3, 2019
    Reply
  10. Hamer Basta said:

    hello
    i have question: is it possible to compare more than 2 PDF? And is it possible to automate it for a bunch of PDF’s

    grt,

    January 15, 2019
    Reply
    • Rohan Dudam Rohan Dudam said:

      Yes, it is possible to compare more than 2 PDF. But you need to keep one PDF as a base. And it is also possible to automate it for a bunch of PDF’s.
      Thanks

      January 17, 2019
      Reply
  11. Anonymous said:

    need help.
    unable to generate images.

    C:\workspace\python Prj\Threads>python rd.py –f1 C:\Users\mohammad.irfan\Desktop\HelloWorld.pdf –f2 C:\Users\mohammad.irfan\Desktop\HelloWorld.pdf
    About to call convert on C:\Users\mohammad.irfan\Desktop\HelloWorld.pdf
    Finished calling convert on C:\Users\mohammad.irfan\Desktop\HelloWorld.pdf
    Total of 1 jpgs produced after converting the pdf file: C:\Users\mohammad.irfan\Desktop\HelloWorld.pdf
    About to call convert on C:\Users\mohammad.irfan\Desktop\HelloWorld.pdf
    Finished calling convert on C:\Users\mohammad.irfan\Desktop\HelloWorld.pdf
    Total of 1 jpgs produced after converting the pdf file: C:\Users\mohammad.irfan\Desktop\HelloWorld.pdf
    diff_images directory created
    Total pages in pdf2: 1
    Total pages in pdf1 : 1
    Check SUCCEEDED: There are an equal number of jpgs created from the pdf generated from pdf2 and pdf1
    Total pages in images: 1
    Inside create diff image.. with args : – [‘HelloWorld.jpg’] [‘HelloWorld.jpg’]
    Error when trying to open image
    About to call convert on C:\workspace\python Prj\Threads\diff_images\*.jpg
    convert: unable to open image ‘C:\workspace\python Prj\Threads\diff_images\*.jpg’: Invalid argument @ error/blob.c/OpenBlob/3485.
    convert: no images defined `C:\workspace\python Prj\Threads\diff_HelloWorld.pdf’ @ error/convert.c/ConvertImageCommand/3300.
    Convert exception … could be an ImageMagick bug
    Command ‘[‘convert’, ‘C:\\workspace\\python Prj\\Threads\\diff_images\\*.jpg’, ‘C:\\workspace\\python Prj\\Threads\\diff_HelloWorld.pdf’]’ returned non-zero exit status
    1.
    Finished calling convert on C:\workspace\python Prj\Threads\diff_images\*.jpg
    Cleaning up all the intermediate jpg files created when comparing the pdf
    Unable to delete jpg file
    [WinError 2] The system cannot find the file specified: ‘C:\\workspace\\python Prj\\Threads\\HelloWorld.jpg’
    Nuking the temporary image_diff directory
    The PDFs didnt match properly, check the diff file generated

    February 15, 2019
    Reply
  12. Mahesh yadav said:

    Hi All,

    Thanks for everything,
    I’m facing the issues, please help me on this.
    Everything is working fine . But diff PDF contains all pages as black.
    Can anyone please help on this issue.

    Thanks & Regrads,
    Mahesh

    June 25, 2019
    Reply
    • Indira Nellutla Indira Nellutla said:

      Hi Mahesh,

      Could you please provide more details about the issue.

      Thanks
      Indira Nellutla

      June 25, 2019
      Reply
  13. Kiran said:

    Hi,
    Program works very nice. Thanks for your efforts.
    I am getting a black image as a result by leaving the difference in content untouched.
    Instead, how can I get the result pdf by highlighting the difference inside a rectangle by retaining the other content of the pdf as it is.

    Please suggest !!

    September 17, 2019
    Reply
  14. Kiran said:

    Currently, the result is generation of a pdf file with black images. Instead of that, can we get a result pdf by highlighting the content difference inside a rectangle. Please suggest !!

    September 17, 2019
    Reply
      • Kiran said:

        Thank you Smitha,
        I am always getting the exception as “Error when trying to open image” on execution of the function
        diff = ImageChops.difference(pdf2_image,pdf1_image) .

        ####################
        pdf2_image = Image.open(download_dir +os.sep+ pdf2_img)
        pdf1_image = Image.open(download_dir+os.sep + pdf1_img)
        diff = ImageChops.difference(pdf2_image,pdf1_image)
        ####################
        Separate images have been created from every page of the pdf document.

        I also tried using Open CV.
        pdf1_image = cv2.imread(download_dir+os.sep + pdf1_img)
        pdf2_image = cv2.imread(download_dir +os.sep+ pdf2_img)
        pdf1_image = cv2.cvtColor(pdf1_image, cv2.COLOR_BGR2GRAY)
        pdf2_image = cv2.cvtColor(pdf2_image, cv2.COLOR_BGR2GRAY)

        cv2.imshow(“pdf1_image”, pdf1_image)
        cv2.imshow(“pdf2_image”, pdf2_image)
        cv2.waitKey(0)
        -> Able to open the image too.
        but the function (score,diff) = compare_ssim(pdf1_image, pdf2_image, full = True) doesnt work.
        Please advise.

        regards,
        Kiran

        September 21, 2019
        Reply
        • Shivahari P Shivahari P said:

          Hi,
          The Error when trying to open image error is very generic. Try printing the error in the except block for create_diff_image method using print(str(e)) statement instead of print('Error when trying to open image').
          Also what does the function compare_ssim do?

          September 26, 2019
          Reply
  15. Richard Carmona said:

    Hello I am trying to run this but my current issue is that PythonMagick is included as a package in the source code but I am unable to install it, I am using Python 3.7, does that matter? Thank you.

    I have also installed ImageMagick and GhostScript

    January 31, 2020
    Reply
      • Swati said:

        Hi Indira,
        I have used the PythonMagic.whl from the url given by you. I have installed wheel via pip and when I am installing PythonMagic via pip by going on the path where PythonMagic.whl files is downloaded. In CLI , it is giving me below error:
        ERROR: PythonMagick-0.9.19-cp38-cp38-win32.whl is not a supported wheel on this platform.

        July 23, 2020
        Reply
  16. Ramesh said:

    When i am trying to execute i am getting the below error. Can you please help me solve it.

    python pdf_files_comparisions.py –f1 “U:\Datagaps\image_Differences\pdfs\Trellis1_Pre.pdf” –f2 “U:\Datagaps\image_Differe
    nces\pdfs\Trellis1_Post.pdf”
    About to call convert on U:\Datagaps\image_Differences\pdfs\Trellis1_Pre.pdf
    Invalid Parameter – U:\Datagaps\image_Differences\pdfs\Trellis1_Pre.jpg
    Convert exception … could be an ImageMagick bug
    Finished calling convert on U:\Datagaps\image_Differences\pdfs\Trellis1_Pre.pdf
    Total of 0 jpgs produced after converting the pdf file: U:\Datagaps\image_Differ
    ences\pdfs\Trellis1_Pre.pdf
    About to call convert on U:\Datagaps\image_Differences\pdfs\Trellis1_Post.pdf
    Invalid Parameter – U:\Datagaps\image_Differences\pdfs\Trellis1_Post.jpg
    Convert exception … could be an ImageMagick bug
    Finished calling convert on U:\Datagaps\image_Differences\pdfs\Trellis1_Post.pdf

    Total of 0 jpgs produced after converting the pdf file: U:\Datagaps\image_Differ
    ences\pdfs\Trellis1_Post.pdf
    diff_images directory created
    Total pages in pdf2: 0
    Total pages in pdf1 : 0
    Check FAILED: There are an unequal number of jpgs created from the pdf generated
    from pdf2 and pdf1
    Total pages in image2 : 0
    Total pages in image1: 0
    ERROR: Skipping image comparison between U:\Datagaps\image_Differences\pdfs\Trel
    lis1_Pre.pdf and U:\Datagaps\image_Differences\pdfs\Trellis1_Post.pdf
    Cleaning up all the intermediate jpg files created when comparing the pdf
    Nuking the temporary image_diff directory
    Traceback (most recent call last):
    File “pdf_files_comparisions.py”, line 150, in
    result_flag = test_obj.get_pdf_diff()
    File “pdf_files_comparisions.py”, line 138, in get_pdf_diff
    return result_flag
    UnboundLocalError: local variable ‘result_flag’ referenced before assignment.

    September 24, 2020
    Reply
    • Rahul Bhave said:

      Hi Ramesh,
      I think the problem here is ImageMagick has failed to convert the PDF into the JPEG file.You can refer below the line from the error log you have given.
      `Convert exception … could be an ImageMagick bug`
      Once this issue is resolved you may be able to proceed with code.I would suggest you to check the ImageMagick site for similar problems and their solutions. One such thread I noticed is as below:
      `www.imagemagick.org/discourse-server/viewtopic.php?t=35171`
      Also, you can also try with different ImageMagick version to check if the problem persists there as well.

      Regards,
      Rahul

      September 25, 2020
      Reply

Leave a Reply

Your email address will not be published.