{"id":3604,"date":"2016-01-29T04:26:19","date_gmt":"2016-01-29T09:26:19","guid":{"rendered":"http:\/\/qxf2.com\/blog\/?p=3604"},"modified":"2024-10-22T02:46:22","modified_gmt":"2024-10-22T06:46:22","slug":"compare-pdfs-python","status":"publish","type":"post","link":"https:\/\/qxf2.com\/blog\/compare-pdfs-python\/","title":{"rendered":"How to compare PDFs using Python"},"content":{"rendered":"<p><strong>Problem:<\/strong>How do you compare two PDF files programmatically using Python?<\/p>\n<p>Adobe makes it <a href=\"http:\/\/blogs.adobe.com\/acrolaw\/2007\/03\/comparing_two_p_1\/\">easy to compare<\/a> the changes in two PDF files. However as testers, we sometimes need to compare a lot of PDF files (especially reports!) against some preset baselines. In these cases, it helps to have a script that can compare PDF files and tell you if they differ in any way. <\/p>\n<p>There are several options. We like <a href=\"http:\/\/www.qtrac.eu\/diffpdf.html\">DiffPDF<\/a>, <a href=\"http:\/\/www.pdf2text.com\/\">pdf2text<\/a>, the pdf-diff python module. Each option comes with its own set of pros and cons. Most of these solutions do a good job of comparing the text in the PDF files. However we noticed that they are somewhat lacking when it comes to comparing graphs and charts. In this post, we show you one more approach which is useful if you have a lot of graphs and charts in your PDF files.<\/p>\n<p><strong>WARNING:<\/strong> You should be using this kind of automated check as a last resort. Ask your developers for other ways to check the data\/content of the PDF files before using this approach.  <\/p>\n<hr>\n<h3>Steps involved<\/h3>\n<p>We will be using image comparison to verify if the two PDF files are identical or not. To do so, we need to:<br \/>\n1. Get setup with ImageMagick and Ghostscript<br \/>\n2. Convert each page of the PDF file into one image<br \/>\n3. Compare corresponding images and save the resulting difference image for every page<br \/>\n4. Stitch all the resulting difference images into a single PDF file<br \/>\n5. Use the utility to compare two PDF files<\/p>\n<p>I have created a class <em>PDF_Image_Compare<\/em> which can be used to compare two PDFs. The class will help you compare two PDF files, list out which pages differ and give you a overlaid images of the two PDF files. Below few steps will explain the different methods and modules which are required to compare two PDF files.<\/p>\n<hr>\n<p><strong>Step 1. Get setup with ImageMagick and Ghostscript<\/strong><br \/>\nThe first step is to convert the PDF file to a different format like jpg. We will use ImageMagick, which in turn uses Ghostscript. To do this you need to:<br \/>\na. Download and install <a href=\"http:\/\/www.imagemagick.org\/script\/index.php\">ImageMagick<\/a> which is a software suite to create, edit, compose, or convert bitmap images<br \/>\nb. ImageMagick needs <a href=\"http:\/\/www.ghostscript.com\/\">Ghostscript<\/a> which is an interpreter for the PostScript language and for PDF.<br \/>\nc. Add both ImageMagick and GhostScript to your path environment variable.<br \/>\nd. Verify you are setup correctly by using the &#8220;convert&#8221; utility. Open a command prompt and run the command &#8216;convert file.pdf file.jpg&#8217; to convert file.pdf into a file.jpg.<\/p>\n<hr>\n<p><strong>Step 2. Convert each page of the PDF file into one image<\/strong><br \/>\nWe plan to use the <strong>difference<\/strong> method in <a href=\"http:\/\/effbot.org\/imagingbook\/imagechops.htm\">Imagechops<\/a> module which returns the absolute value of the difference between the two images. However we can&#8217;t use it directly on a PDF file. So we first have to convert the PDF into a list of images. To do so, we will call <strong>convert<\/strong> from Python using the <strong>subprocess<\/strong> module.  <\/p>\n<pre lang=\"python\">\r\n    def get_image_list_from_pdf(self,pdf_file):\r\n        \"Return a list of images that resulted from running convert on a given pdf\"\r\n        pdf_name = pdf_file.split(os.sep)[-1].split('.pdf')[0]\r\n        pdf_dir = pdf_file.split(pdf_name)[0]\r\n        jpg = pdf_file.split('.pdf')[0]+'.jpg'\r\n        # Convert the pdf file to jpg file\r\n        self.call_convert(pdf_file,jpg)\r\n        #Get all the jpg files after calling convert and store it in a list\r\n        image_list = []        \r\n        file_list = os.listdir(pdf_dir)\r\n        for f in file_list:\r\n            if f[-4:]=='.jpg' and pdf_name in f:\r\n                #Make sure the file names of both pdf are not similar\r\n                image_list.append(f)\r\n\r\n        print('Total of %d jpgs produced after converting the pdf file: %s'%(len(image_list),pdf_file))\r\n        return image_list\r\n        \r\n    \r\n    def call_convert(self,src,dest):\r\n        \"Call convert to convert pdf to jpg\"\r\n        print('About to call convert on %s'%src)\r\n        try:\r\n            subprocess.check_call([\"convert\",src,dest], shell=True)\r\n        except Exception,e:\r\n            print('Convert exception ... could be an ImageMagick bug')\r\n            print(e)\r\n        print('Finished calling convert on %s'%src)\r\n<\/pre>\n<hr>\n<p><strong>Step 3. Compare corresponding images and save the resulting difference image for every page<\/strong><br \/>\nNow we can use the <em>ImageChops.difference()<\/em> method to compare the images from the list of images created.<br \/>\nThe below method would help you achieve this. It will return result_flag as True if the images match and False if they do not. The method will also print out the image pairs that differ.<\/p>\n<pre lang=\"python\">\r\ndef create_diff_image(self,pdf1_list,pdf2_list,diff_image_dir):\r\n        \"Creates the diffed images in diff image directory and generates a pdf by calling call convert\"\r\n        result_flag = True\r\n        for pdf1_img,pdf2_img in zip(pdf1_list,pdf2_list):\r\n            diff_filename = diff_image_dir + os.sep+'diff_' + pdf2_img\r\n            try:\r\n                pdf2_image = Image.open(self.download_dir +os.sep+ pdf2_img)\r\n                pdf1_image = Image.open(self.download_dir+os.sep + pdf1_img)\r\n                diff = ImageChops.difference(pdf2_image,pdf1_image)\r\n                diff.save(diff_filename)\r\n                    \r\n                if (ImageChops.difference(pdf2_image,pdf1_image).getbbox() is None):\r\n                    result_flag = result_flag & True\r\n                else:\r\n                    result_flag = result_flag & False\r\n                    print ('The file didnt match for: \\n>>%s\\nand\\n>>%s'%(self.download_dir +os.sep+ pdf2_img,self.download_dir +os.sep+ pdf1_img))\r\n            except Exception,e:\r\n                print('Error when trying to open image')\r\n                result_flag = result_flag & False\r\n\r\n        return result_flag\r\n<\/pre>\n<hr>\n<p><strong>Step 4. Stitch all the resulting difference images into a single PDF file <\/strong><br \/>\nWe found it useful to present the final difference as one PDF file rather than as a series of images. This makes it easier for the human interpreting the results to quickly identify and summarize the differences. To do so, add the below code to the create_diff_image method above.<\/p>\n<pre lang=\"python\">\r\n        #Create a pdf out of all the jpgs created\r\n        diff_pdf_name = 'diff_'+pdf2_img.split('.jpg')[0]+'.pdf'\r\n        self.call_convert(diff_image_dir+os.sep+'*.jpg', self.download_dir+os.sep+diff_pdf_name)\r\n\r\n        if os.path.exists(diff_pdf_name):\r\n            print('Successfully created the difference pdf: %s'%(diff_pdf_name))\r\n<\/pre>\n<hr>\n<p><strong>Step 5. Use the utility to compare two PDF files<\/strong><\/p>\n<pre lang=\"python\">\r\n\r\nif __name__== '__main__':\r\n    #Lets accept command line options for the location of two PDF files from the user \r\n    #We have chosen to use the Python module optparse \r\n    usage = \"usage: %prog --f1 <pdf1> --f2 <pdf2>\\nE.g.: %prog --f1 'D:\\Image Compare\\Sample.pdf' --f2 'D:\\Image Compare\\Test.pdf'\\n---\"\r\n    parser = OptionParser(usage=usage)\r\n    parser.add_option(\"--f1\",\"--pdf1\",dest=\"pdf1\",help=\"The location of pdf file1\",default=None)\r\n    parser.add_option(\"--f2\",\"--pdf2\",dest=\"pdf2\",help=\"The location of pdf file2\",default=None)\r\n    (options,args) = parser.parse_args()\r\n    \r\n    test_obj = PDF_Image_Compare(pdf1=options.pdf1,pdf2=options.pdf2)\r\n    result_flag = test_obj.get_pdf_diff()\r\n    if result_flag == True:\r\n        print ('The two PDF matched properly')\r\n    else:\r\n        print ('The PDFs didnt match properly, check the diff file generated')\r\n<\/pre>\n<hr>\n<h3>Putting it all together<\/h3>\n<p>Here is how our utility to compare two PDFs look.<\/p>\n<pre lang=\"python\">\r\nfrom PIL import Image, ImageChops\r\nimport os,time,PythonMagick,subprocess,shutil\r\nfrom optparse import OptionParser\r\n\r\nclass PDF_Image_Compare:    \r\n    \"Compare's two pdf files\"\r\n    def __init__(self,pdf1,pdf2):\r\n        \"Constructor: Initialises file1 and file 2\"\r\n        self.download_dir = os.getcwd()\r\n        self.pdf1 = pdf1\r\n        self.pdf2 = pdf2\r\n\r\n\r\n    def get_image_list_from_pdf(self,pdf_file):\r\n        \"Return a list of images that resulted from running convert on a given pdf\"\r\n        pdf_name = pdf_file.split(os.sep)[-1].split('.pdf')[0]\r\n        pdf_dir = pdf_file.split(pdf_name)[0]\r\n        jpg = pdf_file.split('.pdf')[0]+'.jpg'\r\n        # Convert the pdf file to jpg file\r\n        self.call_convert(pdf_file,jpg)\r\n        #Get all the jpg files after calling convert and store it in a list\r\n        image_list = []        \r\n        file_list = os.listdir(pdf_dir)\r\n        for f in file_list:\r\n            if f[-4:]=='.jpg' and pdf_name in f:\r\n                #Make sure the file names of both pdf are not similar\r\n                image_list.append(f)\r\n\r\n        print('Total of %d jpgs produced after converting the pdf file: %s'%(len(image_list),pdf_file))\r\n        return image_list\r\n        \r\n    \r\n    def call_convert(self,src,dest):\r\n        \"Call convert to convert pdf to jpg\"\r\n        print('About to call convert on %s'%src)\r\n        try:\r\n            subprocess.check_call([\"convert\",src,dest], shell=True)\r\n        except Exception,e:\r\n            print('Convert exception ... could be an ImageMagick bug')\r\n            print(e)\r\n        print('Finished calling convert on %s'%src)\r\n\r\n\r\n    def create_diff_image(self,pdf1_list,pdf2_list,diff_image_dir):\r\n        \"Creates the diffed images in diff image directory and generates a pdf by calling call convert\"\r\n        result_flag = True\r\n        for pdf1_img,pdf2_img in zip(pdf1_list,pdf2_list):\r\n            diff_filename = diff_image_dir + os.sep+'diff_' + pdf2_img\r\n            try:\r\n                pdf2_image = Image.open(self.download_dir +os.sep+ pdf2_img)\r\n                pdf1_image = Image.open(self.download_dir+os.sep + pdf1_img)\r\n                diff = ImageChops.difference(pdf2_image,pdf1_image)\r\n                diff.save(diff_filename)\r\n                    \r\n                if (ImageChops.difference(pdf2_image,pdf1_image).getbbox() is None):\r\n                    result_flag = result_flag & True\r\n                else:\r\n                    result_flag = result_flag & False\r\n                    print ('The file didnt match for: \\n>>%s\\nand\\n>>%s'%(self.download_dir +os.sep+ pdf2_img,self.download_dir +os.sep+ pdf1_img))\r\n            except Exception,e:\r\n                print('Error when trying to open image')\r\n                result_flag = result_flag & False\r\n        #Create a pdf out of all the jpgs created\r\n        diff_pdf_name = 'diff_'+pdf2_img.split('.jpg')[0]+'.pdf'\r\n        self.call_convert(diff_image_dir+os.sep+'*.jpg', self.download_dir+os.sep+diff_pdf_name)\r\n\r\n        if os.path.exists(diff_pdf_name):\r\n            print('Successfully created the difference pdf: %s'%(diff_pdf_name))\r\n\r\n        return result_flag\r\n    \r\n\r\n    def cleanup(self,diff_image_dir,pdf1_list,pdf2_list):\r\n        \"Clean up all the image files created\"\r\n        print('Cleaning up all the intermediate jpg files created when comparing the pdf')\r\n        for pdf1_img,pdf2_img in zip(pdf1_list,pdf2_list):\r\n            try:\r\n                os.remove(self.download_dir +os.sep+ pdf1_img)\r\n                os.remove(self.download_dir +os.sep+ pdf2_img)\r\n            except Exception,e:\r\n                print('Unable to delete jpg file')\r\n                print(e)\r\n        print('Nuking the temporary image_diff directory')\r\n        try:\r\n            time.sleep(5)\r\n            shutil.rmtree(diff_image_dir)\r\n        except Exception,e:\r\n            print('Could not delete the image_diff directory')\r\n            print(e)\r\n                \r\n\r\n    def get_pdf_diff(self,cleanup=True):\r\n        \"Create a difference pdf by overlaying the two pdfs and generating an image difference.Returns True if the file matches else returns false\"\r\n\r\n        # Initialize result_flag to False\r\n        result_flag = False\r\n\r\n        #Get the list of images using get_image_list_from_pdf which inturn calls convert on a given pdf  \r\n        pdf1_list = self.get_image_list_from_pdf(self.pdf1)\r\n        pdf2_list = self.get_image_list_from_pdf(self.pdf2)\r\n        \r\n        #If diff directory already does exist - delete it \r\n        #Easier to simply nuke the folder and create it again than to check if its empty\r\n        diff_image_dir = self.download_dir + os.sep+'diff_images'\r\n        if os.path.exists(diff_image_dir):\r\n            print('diff_images directory exists ... about to nuke it')\r\n            shutil.rmtree(diff_image_dir)\r\n\r\n        #Create a new and empty diff directory\r\n        os.mkdir(diff_image_dir)\r\n        print('diff_images directory created')\r\n        print('Total pages in pdf2: %d'%len(pdf2_list))\r\n        print('Total pages in pdf1 : %d'%len(pdf1_list))\r\n\r\n        #Verify that there are equal number pages in pdf1 and pdf2\r\n        if len(pdf2_list)==len(pdf1_list) and len(pdf2_list) !=0:\r\n            print('Check SUCCEEDED: There are an equal number of jpgs created from the pdf generated from pdf2 and pdf1')\r\n            print('Total pages in images: %d'%len(pdf2_list))\r\n            pdf1_list.sort()\r\n            pdf2_list.sort()\r\n\r\n            #Create the diffed images\r\n            result_flag = self.create_diff_image(pdf1_list,pdf2_list,diff_image_dir)\r\n        else:\r\n            print('Check FAILED: There are an unequal number of jpgs created from the pdf generated from pdf2 and pdf1')\r\n            print('Total pages in image2 : %d'%len(pdf2_list))\r\n            print('Total pages in image1: %d'%len(pdf1_list))\r\n            print('ERROR: Skipping image comparison between %s and %s'%(self.pdf1,self.pdf2))\r\n\r\n        if cleanup:\r\n            #Delete all the image files created\r\n            self.cleanup(diff_image_dir,pdf1_list,pdf2_list)            \r\n\r\n        return result_flag\r\n\r\nif __name__== '__main__':\r\n    #Lets accept command line options for the location of two PDF files from the user \r\n    #We have chosen to use the Python module optparse \r\n    usage = \"usage: %prog --f1 <pdf1> --f2 <pdf2>\\nE.g.: %prog --f1 'D:\\Image Compare\\Sample.pdf' --f2 'D:\\Image Compare\\Test.pdf'\\n---\"\r\n    parser = OptionParser(usage=usage)\r\n    parser.add_option(\"--f1\",\"--pdf1\",dest=\"pdf1\",help=\"The location of pdf file1\",default=None)\r\n    parser.add_option(\"--f2\",\"--pdf2\",dest=\"pdf2\",help=\"The location of pdf file2\",default=None)\r\n    (options,args) = parser.parse_args()\r\n    \r\n    test_obj = PDF_Image_Compare(pdf1=options.pdf1,pdf2=options.pdf2)\r\n    result_flag = test_obj.get_pdf_diff()\r\n    if result_flag == True:\r\n        print ('The two PDF matched properly')\r\n    else:\r\n        print ('The PDFs didnt match properly, check the diff file generated')\r\n<\/pre>\n<hr>\n<h3>Run the utility file using command prompt<\/h3>\n<p>You can use the utility file any way you want. Below screenshot shows you how to compare two PDF files using command prompt by passing the location of the PDF files.<\/p>\n<p><a href=\"https:\/\/qxf2.com\/blog\/wp-content\/uploads\/2016\/01\/Running-the-test.jpg\" data-rel=\"lightbox-image-0\" data-rl_title=\"\" data-rl_caption=\"\" title=\"\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/qxf2.com\/blog\/wp-content\/uploads\/2016\/01\/Running-the-test-300x46.jpg\" alt=\"Running the Test\" width=\"300\" height=\"46\" class=\"aligncenter size-medium wp-image-3628\" srcset=\"https:\/\/qxf2.com\/blog\/wp-content\/uploads\/2016\/01\/Running-the-test-300x46.jpg 300w, https:\/\/qxf2.com\/blog\/wp-content\/uploads\/2016\/01\/Running-the-test.jpg 672w\" sizes=\"auto, (max-width: 300px) 100vw, 300px\" \/><\/a><\/p>\n<hr>\n<p>We have used this utility at a couple of our clients. We feel it fills a specific need (comparing image heavy PDF files) pretty well. Hope this post helps you do the same.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Problem:How do you compare two PDF files programmatically using Python? Adobe makes it easy to compare the changes in two PDF files. However as testers, we sometimes need to compare a lot of PDF files (especially reports!) against some preset baselines. In these cases, it helps to have a script that can compare PDF files and tell you if they [&hellip;]<\/p>\n","protected":false},"author":3,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[18],"tags":[],"class_list":["post-3604","post","type-post","status-publish","format-standard","hentry","category-python"],"_links":{"self":[{"href":"https:\/\/qxf2.com\/blog\/wp-json\/wp\/v2\/posts\/3604","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/qxf2.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/qxf2.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/qxf2.com\/blog\/wp-json\/wp\/v2\/users\/3"}],"replies":[{"embeddable":true,"href":"https:\/\/qxf2.com\/blog\/wp-json\/wp\/v2\/comments?post=3604"}],"version-history":[{"count":32,"href":"https:\/\/qxf2.com\/blog\/wp-json\/wp\/v2\/posts\/3604\/revisions"}],"predecessor-version":[{"id":22957,"href":"https:\/\/qxf2.com\/blog\/wp-json\/wp\/v2\/posts\/3604\/revisions\/22957"}],"wp:attachment":[{"href":"https:\/\/qxf2.com\/blog\/wp-json\/wp\/v2\/media?parent=3604"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/qxf2.com\/blog\/wp-json\/wp\/v2\/categories?post=3604"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/qxf2.com\/blog\/wp-json\/wp\/v2\/tags?post=3604"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}