Compare json objects in AWS S3 bucket using deepdiff

Recently, I got a chance to work on the AWS S3 bucket, where I compared the JSON files stored in the S3 bucket with the pre-defined data structure stored as a dictionary object using deepdiff. I can’t actually replicate, the entire system, I had tested. For the blog purpose I have come up with the following prerequisites/setup/flow:


Pre-requisite:

1. AWS login is required and details AWS_ACCOUNT_ID, AWS_DEFAULT_REGION, AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY in the aws_configuration_conf.py file. These are the user specific details.

2. Create a S3 bucket compare-json.

3. Keep json file sample.json in the S3 bucket.

4. In the samples folder expected_message.json which will be used to compare sample.json.


Summary:

In the following sections, I will discuss the following steps:

1. Create a S3 bucket.

2. Create sample.json in the S3 bucket, which will be referred to as Key in this blog.

3. expected_message.json is stored in the samples templates_directory.

4. Execute Python script s3_compare_json.py. I have kept source code here.


Steps:

1. Created a S3 bucket in the AWS. The article here will help you to create an S3 bucket in AWS.

2. Created sample.json in the s3 bucket, which will be referred to as Key in this blog. My sample.json look like below:

{
   "PROFESSIONAL PLAYER":true,
   "NAME":"NADAL",
   "MATCHES PLAYED":750,
   "MATCHES WON":577,
   "MATCHES LOST":123,
   "STATUS":"ACTIVE",
   "COUNTRY":"ESP",
   "TURNED PRO":"2000-02-29",
   "PRICE MONEY":{
      "AMOUNT":8900005,
      "CURRENCY":"USD"
   },
   "ENDORSEMENT FEE":{
      "AMOUNT":400000,
      "CURRENCY":"INR"
   }
}

3. Used below expected_message.json:

{
   "PROFESSIONAL PLAYER":true,
   "NAME":"ABCDEF",
   "MATCHES PLAYED":500,
   "MATCHES WON":400,
   "MATCHES LOST":100,
   "STATUS":"ACTIVE",
   "COUNTRY":"IND",
   "TURNED PRO":"9999-12-31",
   "PRICE MONEY":{
      "AMOUNT":1000000,
      "CURRENCY":"USD"
   },
   "ENDORSEMENT FEE":{
      "AMOUNT":500000,
      "CURRENCY":"USD"
   }
}

Note that, there is a difference between some of the key values of both json, which I have kept purposefully to demo the sample code.

4. Written following python script s3_compare_json.py to compare the Key with the expected json format. Method compare_dict is used to compare dictionary objects created for sample.json and expected_message.json. deepDiff is used to find the difference between two dictionary objects.

"""
This file will contain the following  method and class:
1. Compare dict method.
2. S3Utilities class this has the following methods:
2.a. Get Response from s3 client.
2.b. Convert response into dict object.
2.c. Get Response dict object
2.d. Get expected dict from json stored as expected json
"""
import boto3
import collections
import deepdiff
import json
import logging
import os
import re
import sys
sys.path.append(os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
import conf.aws_configuration_conf as aws_conf
from pythonjsonlogger import jsonlogger
from pprint import pprint
 
# logging
log_handler = logging.StreamHandler()
log_handler.setFormatter(jsonlogger.JsonFormatter())
logger = logging.getLogger()
logger.setLevel(logging.INFO)
logger.addHandler(log_handler)
 
#setting environment variable
os.environ["AWS_ACCOUNT_ID"]= aws_conf.AWS_ACCOUNT_ID
os.environ['AWS_DEFAULT_REGION'] = aws_conf.AWS_DEFAULT_REGION
os.environ['AWS_ACCESS_KEY_ID'] = aws_conf.AWS_ACCESS_KEY_ID
os.environ['AWS_SECRET_ACCESS_KEY'] = aws_conf.AWS_SECRET_ACCESS_KEY
 
# Defining method to compare dict
def compare_dict(response_dict, expected_dict):
    exclude_paths = re.compile(r"\'TURNED PRO\'|\'NAME\'")
    diff = deepdiff.DeepDiff(expected_dict, response_dict,\
        exclude_regex_paths=[exclude_paths],verbose_level=0)
 
    return diff
 
# class to write s3 utilities
class s3utilities():
    logger = logging.getLogger(__name__)
 
    def __init__(self, s3_bucket, key, template_directory):
        # initialising the class
        self.logger.info(f's3 utilities activated')
        self.s3_bucket = s3_bucket
        self.key = key
        self.template_directory = template_directory
        self.s3_client = boto3.client('s3')
 
    def get_response(self, bucket, key):
        # Get Response s3 client object
        response = self.s3_client.get_object(Bucket=bucket, Key=key)
 
        return response
 
    def convert_dict_from_response(self,response):
        # Convert response into dict object
        response_json = ""
        for line in response["Body"].iter_lines():
            response_json += line.decode("utf-8")
        response_dict = json.loads(response_json)
 
        return response_dict
 
    def get_response_dict(self):
        # Get Response dict object
        response = self.get_response(self.s3_bucket,self.key)
        response_dict = self.convert_dict_from_response(response)
 
        return response_dict
 
    def get_expected_dict(self):
        # Get expected dict from json stored as expected json
        current_directory = os.path.dirname(os.path.realpath(__file__))
        message_template = os.path.join(current_directory,\
            self.template_directory,'expected_message.json')
        with open(message_template,'r') as fp:
            expected_dict = json.loads(fp.read())
 
        return expected_dict
 
if __name__ == "__main__":
    # Testing s3utilities
    s3_bucket = "compare-json"
    key = 'sample.json'
    template_directory = 'samples'
    s3utilities_obj = s3utilities(s3_bucket, key, template_directory)
    response_dict = s3utilities_obj.get_response_dict()
    expected_dict = s3utilities_obj.get_expected_dict()
    diff = compare_dict(response_dict, expected_dict)
    print("=========================================================")
    pprint(f'Actual difference between two jsons is: \n {diff}')
    print("=========================================================")

When I ran the script using command python s3_compare_json.py, the difference in values changed between the expected json and sample json is shown on the console. Note that, TURNED PRO and NAME are different between both jsons, but it is filtered out from the result as that has excluded in the following code snippet:

exclude_paths = re.compile(r"\'TURNED PRO\'|\'NAME\'")
diff = deepdiff.DeepDiff(expected_dict, response_dict,\
     exclude_regex_paths=[exclude_paths],verbose_level=0)


I hope you have liked the blog. The source code is available here. You can find some useful documentation about deepdiff here.


One thought on “%1$s”

  1. A good solution that does not require the extra cost of transferring S3 object to disk. The exclusion of the ‘Turned Pro’ was a nice icing for other uses. Thanks for sharing

Leave a Reply

Your email address will not be published. Required fields are marked *