{"id":17608,"date":"2023-03-06T09:00:47","date_gmt":"2023-03-06T14:00:47","guid":{"rendered":"https:\/\/qxf2.com\/blog\/?p=17608"},"modified":"2023-03-06T09:00:47","modified_gmt":"2023-03-06T14:00:47","slug":"setting-up-synthetic-data-in-neo4j","status":"publish","type":"post","link":"https:\/\/qxf2.com\/blog\/setting-up-synthetic-data-in-neo4j\/","title":{"rendered":"Setting up Synthetic data in Neo4j"},"content":{"rendered":"<p><a href=\"https:\/\/qxf2.com\/?utm_source=neo4jSynthetic&#038;utm_medium=click&#038;utm_campaign=From%20blog\">Qxf2<\/a> engineers are fans of using synthetic data for testing. We have used this technique for years now. But until recently, our experience was limited to SQL. We needed to work with neo4j in one of our projects. There was enough custom Python code and Cypher queries that we had to write as part of creating, backing up and restoring data. So we thought of sharing some of it here. <\/p>\n<p>PS: To help you play along, we are going to implement some seed data for one of our internal projects. The code is open sourced and you can find &#8220;Further Reading&#8221; section. <\/p>\n<p><h><\/p>\n<p>These are some principles we use when we designed synthetic data:<\/p>\n<p>1. <strong>Understand the Data Model:<\/strong> It is important to have a clear understanding of the data model before creating synthetic data. This allows for the creation of relevant and meaningful synthetic data, which can help identify potential issues.<\/p>\n<p>2. <strong>Easy Interpretation:<\/strong> Synthetic data should be designed in such a way that its purpose can be easily grasped by a viewer at a glance. This implies that the data must be easy to comprehend and interpret.<\/p>\n<p>3. <strong>Vary the Data:<\/strong> It is important to vary the synthetic data to test different scenarios. This includes varying the data types, values, and relationships to ensure that the database can handle a wide range of data.<\/p>\n<p>4. <strong>Make the data extensible:<\/strong> When we create synthetic data for the first time in a project, it is important to design it in such a way that the data is easily extensible. We want to be able to add new data in future without affecting any of the existing tests. <\/p>\n<p><h><\/p>\n<h3> Understanding the data model<\/h3>\n<p>Before creating the synthetic data, it&#8217;s important to have a clear understanding of the data model. For this example, we&#8217;ll be creating synthetic data for the <a href=\"https:\/\/github.com\/qxf2\/qxf2-survey\" rel=\"noopener\" target=\"_blank\">Qxf2-survey<\/a> app, which uses the Neo4j database. The following ER diagram represents the data model we&#8217;ll be working with:<br \/>\n<figure id=\"attachment_17776\" aria-describedby=\"caption-attachment-17776\" style=\"width: 600px\" class=\"wp-caption alignnone\"><a href=\"https:\/\/qxf2.com\/blog\/wp-content\/uploads\/2023\/03\/neo4j_ER-4.png\" data-rel=\"lightbox-image-0\" data-rl_title=\"\" data-rl_caption=\"\" title=\"\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/qxf2.com\/blog\/wp-content\/uploads\/2023\/03\/neo4j_ER-4.png\" alt=\"Neo4j database model diagram\" width=\"600\" height=\"553\" class=\"size-full wp-image-17776\" srcset=\"https:\/\/qxf2.com\/blog\/wp-content\/uploads\/2023\/03\/neo4j_ER-4.png 600w, https:\/\/qxf2.com\/blog\/wp-content\/uploads\/2023\/03\/neo4j_ER-4-300x277.png 300w\" sizes=\"auto, (max-width: 600px) 100vw, 600px\" \/><\/a><figcaption id=\"caption-attachment-17776\" class=\"wp-caption-text\">Entity Relationship Diagram of Neo4j Database model for Qxf2-Survey App<\/figcaption><\/figure><\/p>\n<p>1. As you can see from the above diagram, The Neo4j data model consists of two node labels: &#8220;<strong>Employees<\/strong>&#8221; and &#8220;<strong>Technology<\/strong>&#8220;. The &#8220;<strong>Employees<\/strong>&#8221; label represents individuals in the organization and has properties such as <code>author_name<\/code>, <code>firstName<\/code>, <code>lastName<\/code>, <code>fullName<\/code>, <code>email<\/code>, <code>ID<\/code>, and <code>status<\/code>. The &#8220;<strong>Technology<\/strong>&#8221; label represents the various technologies used in the organization and has properties such as <code>technology_name<\/code> and <code>first_seen<\/code>, both of which are stored as date strings.<\/p>\n<p>2. Also, we see that there is a relationship between the &#8220;<strong>Employee<\/strong>&#8221; and &#8220;<strong>Technology<\/strong>&#8221; nodes called &#8220;<strong>KNOWS<\/strong>&#8220;, which denotes the knowledge that an employee has about a technology. The &#8220;KNOWS&#8221; relationship has one property named &#8220;<code>learnt_dates<\/code>&#8220;, which stores the dates on which the employee learned about the technology.<\/p>\n<p>3. In addition to the &#8220;<strong>KNOWS<\/strong>&#8221; relationship, there are two relationships among the &#8220;<strong>Employees<\/strong>&#8221; nodes: &#8220;<strong>Given<\/strong>&#8221; and &#8220;<strong>Taken<\/strong>&#8220;. The &#8220;<strong>Given<\/strong>&#8221; relationship represents the help given by an employee to another and has a property named &#8220;<code>helpgiven<\/code>&#8220;, which stores the date on which the help was given. The &#8220;<strong>Taken<\/strong>&#8221; relationship represents the help taken by an employee from another and has a property named &#8220;<code>helptaken<\/code>&#8220;, which stores the date on which the help was taken.<\/p>\n<h3>Understanding the API<\/h3>\n<p>1. In this example, our focus is to create synthetic data to test a single API endpoint from the <strong><a href=\"https:\/\/github.com\/qxf2\/qxf2-survey\" rel=\"noopener\" target=\"_blank\">Qxf2-survey<\/a><\/strong> app. We will be testing the endpoint <code>\/survey\/admin\/QElo_filter_response<\/code>, which is designed to return the help responses that occurred within a specific date range.<\/p>\n<p>2. The endpoint queries the <strong>GIVEN<\/strong> and <strong>TAKEN<\/strong> relationship amongst the employees, filters the <code>helpgiven<\/code> and <code>helptaken<\/code> dates based on the date parameters passed by the user and returns a response similar to this:<\/p>\n<pre lang=\"json\">\r\n[\r\n  {\r\n    \"respondent_id\": 1,\r\n    \"date\": \"2022-02-18\",\r\n    \"question_no\": 1,\r\n    \"answer\": \"dummy_user\"\r\n  },\r\n  {\r\n    \"respondent_id\": 2,\r\n    \"date\": \"2022-02-18\",\r\n    \"question_no\": 2,\r\n    \"answer\": \"dummy_user\"\r\n  }\r\n...\r\n...\r\n]\r\n<\/pre>\n<p>3. In the above response, The &#8220;<code>respondent_id<\/code>&#8221; field identifies the employee who submitted the survey response, while the &#8220;<code>date<\/code>&#8221; field indicates the date on which the employee either gave or received help from another employee. The &#8220;<code>question_no<\/code>&#8221; field provides additional detail on the type of help exchanged, with a value of 1 indicating &#8220;<strong>TAKEN<\/strong>&#8221; help and a value of 2 indicating &#8220;<strong>GIVEN<\/strong>&#8221; help. Finally, the &#8220;<code>answer<\/code>&#8221; field allows you to identify the employee who provided or received help from the respondent.<\/p>\n<h3> Creating the synthetic data for Neo4j<\/h3>\n<p>Now that we have a clear understanding of the data model and the API to test, we can start creating our synthetic data.<br \/>\nWe will break down the process to create the synthetic data into two parts:<br \/>\nA. Creating the synthetic Nodes<br \/>\nB. Creating the synthetic relationships<\/p>\n<h4>A. Creating the synthetic Nodes<\/h4>\n<p>As we know, the API endpoint we are testing queries the &#8216;<strong>GIVEN<\/strong>&#8216; and &#8216;<strong>TAKEN<\/strong>&#8216; relationships. In addition, we need to ensure that it does not return help responses from inactive users as well. To test this, we will create three separate nodes: one for an employee who &#8216;Gives&#8217; help, another for an employee who &#8216;Takes&#8217; help, and a third node for an inactive employee. This will allow us to test the endpoint accurately and effectively to ensure that it works as intended.<\/p>\n<p>1. First, open your Neo4j Browser or desktop application, create a new database &#8216;neo4j&#8217; where you want to create the nodes and start the database.<\/p>\n<p>2. In the Cypher editor, lets type the following command to create our first node:<\/p>\n<pre lang=\"neo4j\">\r\nCREATE (:Employees {\r\n  author_name: \"Generous Giver\",\r\n  lastName: \"Giver\",\r\n  firstName: \"Generous\",\r\n  fullName: \"Generous Giver\",\r\n  ID: 1,\r\n  email: \"generousgiver@qxf2.com\",\r\n  status: \"Y\"\r\n})\r\n<\/pre>\n<p>3. Pay attention to the naming, we have named our employee as <strong><em>Generous Giver<\/em><\/strong>. It is understood by common knowledge that the node represents an employee who gives a lot of help to other employees. Our synthetic data should be commonly understood  and meaningful. Therefore, we should strive to use similar naming conventions for all our synthetic data to ensure consistency and clarity<\/p>\n<p>4. Now, press the &#8220;Run&#8221; button to execute the command. You should see a message indicating that one node has been created.<br \/>\n<figure id=\"attachment_17718\" aria-describedby=\"caption-attachment-17718\" style=\"width: 900px\" class=\"wp-caption alignnone\"><a href=\"https:\/\/qxf2.com\/blog\/wp-content\/uploads\/2023\/02\/neo4j_query.png\" data-rel=\"lightbox-image-1\" data-rl_title=\"\" data-rl_caption=\"\" title=\"\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/qxf2.com\/blog\/wp-content\/uploads\/2023\/02\/neo4j_query-1024x275.png\" alt=\"Neo4j node creation\" width=\"900\" height=\"242\" class=\"size-large wp-image-17718\" srcset=\"https:\/\/qxf2.com\/blog\/wp-content\/uploads\/2023\/02\/neo4j_query-1024x275.png 1024w, https:\/\/qxf2.com\/blog\/wp-content\/uploads\/2023\/02\/neo4j_query-300x81.png 300w, https:\/\/qxf2.com\/blog\/wp-content\/uploads\/2023\/02\/neo4j_query-768x207.png 768w, https:\/\/qxf2.com\/blog\/wp-content\/uploads\/2023\/02\/neo4j_query-1536x413.png 1536w, https:\/\/qxf2.com\/blog\/wp-content\/uploads\/2023\/02\/neo4j_query.png 1811w\" sizes=\"auto, (max-width: 900px) 100vw, 900px\" \/><\/a><figcaption id=\"caption-attachment-17718\" class=\"wp-caption-text\">Successfully creating a new node in Neo4j<\/figcaption><\/figure><\/p>\n<p>5. Now that we have our first node created, let&#8217;s Repeat the process to create two other nodes:<\/p>\n<pre lang=\"neo4j\">\r\nCREATE (:Employees {\r\n  author_name: \"Generous Taker\",\r\n  lastName: \"Taker\",\r\n  firstName: \"Generous\",\r\n  fullName: \"Generous Taker\",\r\n  ID: 2,\r\n  email: \"generoustaker@qxf2.com\",\r\n  status: \"Y\"\r\n})\r\n\r\nCREATE (:Employees {\r\n  author_name: \"Inactive User\",\r\n  lastName: \"User\",\r\n  firstName: \"Inactive\",\r\n  fullName: \"Inactive User\",\r\n  ID: 3,\r\n  email: \"inactive_user@qxf2.com\",\r\n  status: \"N\"\r\n})\r\n\r\n<\/pre>\n<p>6. We have named the two employees as <strong><em>Generous Taker<\/em><\/strong> and <strong><em>Inactive User<\/em><\/strong>. Again as the name implies , it is understood that <strong><em>Generous Taker<\/em><\/strong> represents an employee who receives a lot of help, while <strong><em>Inactive User<\/em><\/strong> is a an employee who is no longer active with the company.<\/p>\n<p>Now that we have our nodes created, let&#8217;s see how can create the relationships for our synthetic data<\/p>\n<h4> Creating the synthetic relationships<\/h4>\n<p>From the data model shown above, we know that there can be two relationships &#8216;<strong>GIVEN<\/strong>&#8216; or &#8216;<strong>TAKEN<\/strong>&#8216; amongst  <strong>Employees<\/strong>. So lets add these relations to our synthetic data.<\/p>\n<p>1. First, lets add a relationship &#8216;<strong>GIVEN<\/strong>&#8216; between employees <strong><em>Generous Giver<\/em><\/strong> and <strong><em>Generous Taker<\/em> <\/strong>, set the <strong>helpgiven<\/strong> property and assign some dates to it.<\/p>\n<pre lang=\"neo4j\">\r\nMATCH (a:Employees) WHERE a.lastName=\"Giver\" \r\nMATCH (b:Employees) WHERE b.lastName=\"Taker\"\r\nMERGE (a)-[R:GIVEN]->(b)\r\nSET R.helpgiven = [\"1970-01-02\", \"1970-01-09\", \"1970-01-16\"]\r\n<\/pre>\n<p>If you notice, we have assigned some past dates to the <strong>helpgiven<\/strong> property, which corresponds to the beginning of the Unix epoch. This is done to separate the synthetic data from real data and make it easier for a person to distinguish between them. You can use similar techniques or patterns for your use case as well.<\/p>\n<p>2. Similarly, lets create a <strong>TAKEN<\/strong> relationship between employees <strong><em>Generous Taker<\/em><\/strong> and <strong><em>Generous Giver<\/em><\/strong>, set the <strong>helptaken<\/strong> relationship property, and assign some dates to it<\/p>\n<pre lang=\"neo4j\">\r\nMATCH (a:Employees) WHERE a.lastName=\"Taker\" \r\nMATCH (b:Employees) WHERE b.lastName=\"Giver\"\r\nMERGE (a)-[R:TAKEN]->(b)\r\nSET R.helptaken = [\"1970-01-02\", \"1970-01-09\", \"1970-01-16\"]\r\n<\/pre>\n<p>3. Since we also need to verify the API response for inactive users as well, lets create a &#8216;<strong>GIVEN<\/strong>&#8216; relationship between <strong><em>Generous Giver<\/em><\/strong> and <strong><em>Inactive User<\/em><\/strong> <\/p>\n<pre lang=\"neo4j\">\r\nMATCH (a:Employees) WHERE a.fullName=\"Generous Giver\" \r\nMATCH (b:Employees) WHERE b.fullName=\"Inactive User\"\r\nMERGE (a)-[R:GIVEN]->(b)\r\nSET R.helptaken = [\"1975-01-10\"]\r\n<\/pre>\n<p>4. Let&#8217;s also create a &#8216;GIVEN&#8217; relationship between <strong><em>Inactive User<\/em><\/strong> and <strong><em>Generous Taker<\/em> <\/strong><\/p>\n<pre lang=\"neo4j\">\r\nMATCH (a:Employees) WHERE a.fullName=\"Inactive User\" \r\nMATCH (b:Employees) WHERE b.fullName=\"Generous Taker\"\r\nMERGE (a)-[R:GIVEN]->(b)\r\nSET R.helptaken = [\"1975-01-17\"]\r\n<\/pre>\n<p>5. Now, run the following query:<\/p>\n<pre lang=\"neo4j\">\r\nMATCH (n:Employees) \r\nRETURN n\r\n<\/pre>\n<p>You should see a graph similar to this:<br \/>\n<figure id=\"attachment_17720\" aria-describedby=\"caption-attachment-17720\" style=\"width: 900px\" class=\"wp-caption alignnone\"><a href=\"https:\/\/qxf2.com\/blog\/wp-content\/uploads\/2023\/02\/neo4j_relationship.png\" data-rel=\"lightbox-image-2\" data-rl_title=\"\" data-rl_caption=\"\" title=\"\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/qxf2.com\/blog\/wp-content\/uploads\/2023\/02\/neo4j_relationship-1024x481.png\" alt=\"Graph representation of Neo4j database\" width=\"900\" height=\"423\" class=\"size-large wp-image-17720\" srcset=\"https:\/\/qxf2.com\/blog\/wp-content\/uploads\/2023\/02\/neo4j_relationship-1024x481.png 1024w, https:\/\/qxf2.com\/blog\/wp-content\/uploads\/2023\/02\/neo4j_relationship-300x141.png 300w, https:\/\/qxf2.com\/blog\/wp-content\/uploads\/2023\/02\/neo4j_relationship-768x360.png 768w, https:\/\/qxf2.com\/blog\/wp-content\/uploads\/2023\/02\/neo4j_relationship.png 1432w\" sizes=\"auto, (max-width: 900px) 100vw, 900px\" \/><\/a><figcaption id=\"caption-attachment-17720\" class=\"wp-caption-text\">Graph representation of synthetic data created for Neo4j<\/figcaption><\/figure><\/p>\n<p>Finally, we are done setting up our synthetic data. Next lets take a look at how we can backup the synthetic data we created.<\/p>\n<h3> Creating a backup of the synthetic data in Neo4j<\/h3>\n<p>We will be using the &#8216;neo4j-backup&#8217; Python library to create a script that would backup our database. I prefer this approach over the traditional method of creating a dump file for backup and restore, as it can be challenging to automate the latter.<\/p>\n<h4> Initial setup <\/h4>\n<p>1. Before starting , you need to have at least <strong>Python version 3.6<\/strong> or above installed on your machine.<\/p>\n<p>2. Let&#8217;s start off by installing the necessary python libraries:<\/p>\n<pre lang=\"python\">\r\npip install neo4j neo4j_backup decouple\r\n<\/pre>\n<p>3. Next, let&#8217;s create a new Python file and name it &#8220;<em>neo4j_backup_script.py<\/em>&#8220;. <\/p>\n<p>We are all set to start writing the script!<\/p>\n<h4> Writing the script to backup the synthetic data<\/h4>\n<p>1. Let&#8217;s start by importing the libraries that we would be using in our script.<\/p>\n<pre lang=\"python\">\r\nfrom neo4j import GraphDatabase\r\nfrom neo4j_backup import Extractor\r\nfrom decouple import config\r\nimport shutil\r\nimport argparse\r\nimport os\r\n<\/pre>\n<p>2. Now, lets define the variables we need to connect to our Neo4j database in an environment file. Create the <code>.env<\/code> file in the same directory as our python script.<\/p>\n<pre lang=\"python\">\r\nDATABASE_HOST=\"bolt:\/\/localhost:7687\"\r\nDATABASE_USERNAME=\"neo4j\"\r\nDATABASE_PASSWORD=\"dummy-password\"\r\n<\/pre>\n<p><strong>Note:<\/strong> To prevent the credentials from being exposed, it&#8217;s important to add the <code>.env<\/code> file to your <code>.gitignore<\/code>. This will ensure that the file is not included in version control.<\/p>\n<p>3. Next, let&#8217;s fetch the environment variables we created in the previous step in our python script. We will be using the <code>config<\/code> function from the <code>decouple<\/code> library to achieve this.<\/p>\n<pre lang=\"python\">\r\nHOSTNAME = config(\"DATABASE_HOST\")\r\nUSERNAME = config(\"DATABASE_USERNAME\")\r\nPASSWORD = config(\"DATABASE_PASSWORD\")\r\n<\/pre>\n<p>4. Now that we have a way to connect to the database, our script also needs to know the name of our database and the directory where the backup should be stored. So let\u2019s use the <code>argparse<\/code> module to get these values from the user as command line arguments.<\/p>\n<pre lang=\"python\">\r\nparser = argparse.ArgumentParser()\r\n\r\nparser.add_argument(\"--database_name\", type=str, help=\"Provide the name of the database\", nargs='?', default=\"neo4j\", const=0)\r\nparser.add_argument(\"--save_dir\", type=str, help=\"Provide the name of the directory in which the backup would be stored\", nargs='?', default=\"synthetic_data\", const=0)\r\n\r\nargs = parser.parse_args()\r\ndatabase = args.database_name\r\nproject_dir = args.save_dir\r\n<\/pre>\n<p>The <code>--database_name<\/code> argument is used to specify the name of the database that needs to be backed up, and the <code>--save_dir<\/code> argument is used to specify the directory where the backup is to be stored.<\/p>\n<p>5. Next, lets define the connection settings for the Neo4j database and create a driver object using the <code>GraphDatabase.driver()<\/code> function. This will help us connect to our database.<\/p>\n<pre lang=\"python\">\r\nencrypted = False\r\ntrust = \"TRUST_ALL_CERTIFICATES\"\r\ndriver = GraphDatabase.driver(HOSTNAME, auth=(USERNAME, PASSWORD), encrypted=encrypted, trust=trust)\r\n<\/pre>\n<p><strong>Note:<\/strong> The code above sets <code>encrypted=False<\/code> for an unencrypted connection to the Neo4j database. Only use this for local databases. For remote servers, use <code>encrypted=True<\/code> for encrypted SSL\/TLS connections to secure transmitted data.<\/p>\n<p>6. Now that we have connected to our database, we can finally extract the data to create the backup.  We can do this  using the Extractor class from the <code>neo4j_backup<\/code> library.<\/p>\n<pre lang=\"python\">\r\ninput_yes = False\r\ncompress = True\r\nextractor = Extractor(project_dir=project_dir, driver=driver, database=database, input_yes=input_yes, compress=compress)\r\nextractor.extract_data()\r\n<\/pre>\n<p>7. Finally, let&#8217;s compress the backup directory using the <code>make_archive()<\/code> function from the <code>shutil<\/code> library, and then delete the original directory using the <code>rmtree()<\/code> function.<\/p>\n<pre lang=\"python\">\r\nshutil.make_archive(project_dir, 'zip', project_dir)\r\nshutil.rmtree(project_dir)\r\n<\/pre>\n<p>8. Our complete script should look like this:<\/p>\n<pre lang=\"python\">\r\n\"\"\"\r\nBackup Neo4j database\r\n\"\"\"\r\nfrom neo4j import GraphDatabase\r\nfrom neo4j_backup import Extractor\r\nfrom decouple import config\r\nimport shutil\r\nimport argparse\r\nimport os\r\n\r\n# Grabbing environment variables\r\nHOSTNAME = config(\"DATABASE_HOST\")\r\nUSERNAME = config(\"DATABASE_USERNAME\")\r\nPASSWORD = config(\"DATABASE_PASSWORD\")\r\n\r\nif __name__ == \"__main__\":\r\n\r\n    #Add command line arguments to fetch import file and database name\r\n    parser = argparse.ArgumentParser()\r\n\r\n    #Command line argument to fetch database name. Database name is taken as 'neo4j' if no argument is specified\r\n    parser.add_argument(\"--database_name\", type=str, help=\"Provide the name of the database\",\r\n                        nargs='?', default=\"neo4j\", const=0)\r\n    parser.add_argument(\"--save_dir\", type=str, help=\"Provide the name of the directory in which the backup would be stored\",\r\n                        nargs='?', default=\"synthetic_data\", const=0)\r\n    args = parser.parse_args()\r\n    database = args.database_name\r\n    encrypted = False\r\n    trust = \"TRUST_ALL_CERTIFICATES\"\r\n    driver = GraphDatabase.driver(HOSTNAME, auth=(USERNAME, PASSWORD), encrypted=encrypted, trust=trust)\r\n    project_dir = args.save_dir\r\n    input_yes = False\r\n    compress = True\r\n\r\n    #Extract data from database and store the backup\r\n    extractor = Extractor(project_dir=project_dir, driver=driver, database=database,\r\n                          input_yes=input_yes, compress=compress)\r\n    extractor.extract_data()\r\n    shutil.make_archive(project_dir, 'zip', project_dir)\r\n    shutil.rmtree(project_dir)\r\n\r\n<\/pre>\n<h4> Running the backup script<\/h4>\n<p>1. To run the backup script, first make sure that the neo4j database you want to backup is running.<\/p>\n<p>2. Open a terminal, navigate to the location of the <em>neo4j_backup_script.py<\/em> file and run the following command:<\/p>\n<pre lang=\"python\">\r\npython neo4j_backup_script.py --database_name <database_name> --save_dir <save_directory>\r\n<\/pre>\n<p>Make sure to replace <code>database_name<\/code> with the name of the database you want to backup, and <code>save_directory<\/code>with the directory where you want to save the backup file.<br \/>\n<figure id=\"attachment_17810\" aria-describedby=\"caption-attachment-17810\" style=\"width: 900px\" class=\"wp-caption alignnone\"><a href=\"https:\/\/qxf2.com\/blog\/wp-content\/uploads\/2023\/03\/neo4j_backup_1.png\" data-rel=\"lightbox-image-3\" data-rl_title=\"\" data-rl_caption=\"\" title=\"\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/qxf2.com\/blog\/wp-content\/uploads\/2023\/03\/neo4j_backup_1-1024x143.png\" alt=\"Execution of Neo4j backup script\" width=\"900\" height=\"126\" class=\"size-large wp-image-17810\" srcset=\"https:\/\/qxf2.com\/blog\/wp-content\/uploads\/2023\/03\/neo4j_backup_1-1024x143.png 1024w, https:\/\/qxf2.com\/blog\/wp-content\/uploads\/2023\/03\/neo4j_backup_1-300x42.png 300w, https:\/\/qxf2.com\/blog\/wp-content\/uploads\/2023\/03\/neo4j_backup_1-768x107.png 768w, https:\/\/qxf2.com\/blog\/wp-content\/uploads\/2023\/03\/neo4j_backup_1-1536x215.png 1536w, https:\/\/qxf2.com\/blog\/wp-content\/uploads\/2023\/03\/neo4j_backup_1.png 1897w\" sizes=\"auto, (max-width: 900px) 100vw, 900px\" \/><\/a><figcaption id=\"caption-attachment-17810\" class=\"wp-caption-text\">Executing the script to backup the Neo4j database<\/figcaption><\/figure><\/p>\n<p>3. Once the script finish running, navigate to the <code>save_dir<\/code> location that you specified in the command to run the script. You should see a new zip file created which contains the backup of your database.<\/p>\n<p>Kudos! We have successfully backed up our synthetic data. Next, let&#8217;s have a look at how we can restore the data that we backed up.<\/p>\n<h3> Restoring the synthetic data <\/h3>\n<p>In the previous section, we used the neo4j-backup library to create a backup of our database. We will be using same library again to restore our synthetic data as well.<br \/>\nWe will not cover the initial setup required for this script, as it is identical to the steps outlined in the previous section to create the backup script.<\/p>\n<h4> Writing the script to restore the synthetic data in Neo4j<\/h4>\n<p>Let&#8217;s start by creating a new python file and name it <em>&#8220;neo4j_restore_script.py&#8221;<\/em>.<br \/>\nWith the file in place, we&#8217;re all set to start writing the script to restore the synthetic data.<br \/>\nThe creation of the script will be divided into the following sections:<br \/>\nA. Extracting the backup file<br \/>\nB. Importing the backup data<br \/>\nC. Putting it all together<\/p>\n<h5> A. Extracting the backup file<\/h5>\n<p>In this section, we&#8217;ll explore the process of extracting the backup file and preparing it for import.<br \/>\n1. First lets import the libraries that we would be using in our script.<\/p>\n<pre lang=\"python\">\r\nimport sys\r\nfrom neo4j import GraphDatabase\r\nfrom neo4j_backup import Importer\r\nfrom py2neo import Graph\r\nfrom decouple import config\r\nfrom zipfile import ZipFile\r\nimport argparse\r\n<\/pre>\n<p>2. Next, in a similar fashion to the backup script, we must define the necessary variables for connecting to our database. As we&#8217;re using the same variables as in the backup script, there&#8217;s no need to define or add any new variables to our environment file. Therefore, we can simply retrieve these environment variables in our script.<\/p>\n<pre lang=\"python\">\r\n# Grabbing environment variables\r\nHOSTNAME = config(\"DATABASE_HOST\")\r\nUSERNAME = config(\"DATABASE_USERNAME\")\r\nPASSWORD = config(\"DATABASE_PASSWORD\")\r\n<\/pre>\n<p>3. Now, lets go ahead and define the <code>main<\/code> function. As with the backup script, we&#8217;ll use the <code>argparse<\/code> module to define the command line arguments that the restore script will accept. To be precise, we&#8217;ll specify two arguments <code>database_name<\/code> and <code>import_file<\/code>.<\/p>\n<pre lang=\"python\">\r\nif __name__ == \"__main__\":\r\n\r\n    #Add command line arguments to fetch import file and database name\r\n    parser = argparse.ArgumentParser()\r\n    #Command line argument to fetch import file. File name is taken as 'synthetic_data.zip' if no argument is specified\r\n    parser.add_argument(\"--import_file\", type=str, help=\"Provide the import file zip\",\r\n                        nargs='?', default=\"synthetic_data.zip\", const=0)\r\n    #Command line argument to fetch database name. Database name is taken as 'neo4j' if no argument is specified\r\n    parser.add_argument(\"--database_name\", type=str, help=\"Provide the name of the database\",\r\n                        nargs='?', default=\"neo4j\", const=0)\r\n    args = parser.parse_args()\r\n    IMPORT_FILE = args.import_file\r\n    database = args.database_name\r\n<\/pre>\n<p>The <code>--database_name<\/code> argument is used to specify the name of the database that would be restored, and the <code>--import_file<\/code> argument is used to locate the backup file which was created in the previous section of this post.<\/p>\n<p>4. Now, given that our backup folder is archived, it&#8217;s necessary to extract or unzip the archived folder before we can import it. To accomplish this, let&#8217;s first define a function to extract our archived file.<\/p>\n<pre lang=\"python\">\r\ndef unzip_file(IMPORT_FILE,extract_dir):\r\n    with ZipFile(IMPORT_FILE, 'r') as zip:\r\n        # printing all the contents of the zip file\r\n        zip.printdir()    \r\n        # extracting all the files\r\n        print('Extracting all the files now...')\r\n        zip.extractall(extract_dir)\r\n        print('Done!')\r\n<\/pre>\n<p>Here, the <code>IMPORT_FILE<\/code> parameter is the zip file containing the neo4j backup, and <code>extract_dir<\/code> is the name of the directory where  the archive will be extracted to.<\/p>\n<p>5. Next, Let&#8217;s call this function from within our <code>main<\/code> function:<\/p>\n<pre lang=\"python\">\r\n#Get the directory into which the archive will be extracted\r\nextract_dir = os.path.splitext(IMPORT_FILE)[0]    \r\nunzip_file(IMPORT_FILE,extract_dir)\r\n<\/pre>\n<p>We now have our backup folder ready to be imported. Next, lets take a look at how we can import this backup data to a new database.<\/p>\n<h5> B. Importing the backup <\/h5>\n<p>1. Prior to restoring our database, it is necessary to clear any existing data to ensure accurate restoration. This can be achieved by executing a cypher query that clears the database. In order to run the query, we will be utilizing the <code>Graph<\/code> library from <code>py2neo<\/code>.<\/p>\n<p>2. First, let&#8217;s create a new file that will hold our cypher queries and name it <em>&#8220;cypher.py&#8221;<\/em>. Then, add the following query to the file:<\/p>\n<pre lang=\"python\">\r\n#Delete all the records in the Database(WARNING: Never use this query in production database)\r\nDELETE_ALL_RECORDS = \"MATCH (n)\\\r\n                      DETACH DELETE n\"\r\n<\/pre>\n<p>3. Next, lets import this file in our python script<\/p>\n<pre lang=\"python\">\r\nimport cypher\r\n<\/pre>\n<p>4. Now that we have imported the Query, lets define a constant <code>GRAPH<\/code> in our <code>main<\/code> function ,which will hold the authentication for the Neo4j database that will enable us to run the query against the database<\/p>\n<pre lang=\"python\">\r\nGRAPH = auth()\r\n<\/pre>\n<p>5. Now lets define the <code>auth()<\/code> function.<\/p>\n<pre lang=\"python\">\r\ndef auth():\r\n    \"Authenticating with the Database\"\r\n    GRAPH = None\r\n    try:\r\n        GRAPH = Graph(HOSTNAME, auth=(USERNAME, PASSWORD))\r\n        print(\"Database authenticated\")\r\n    except Exception as error:\r\n        raise RuntimeError('Database authentication failed') from error\r\n    return GRAPH\r\n<\/pre>\n<p>6. We can now run the query to clear the database in our <code>main<\/code> function:<\/p>\n<pre lang=\"python\">\r\n#Clear the existing data in database\r\nclear_database = GRAPH.run(cypher.DELETE_ALL_RECORDS)\r\n<\/pre>\n<p>7. Having cleared our database, we can now proceed to restore the synthetic data. For this purpose, we need to define the connection settings for the Neo4j database and create a driver object using the <code>GraphDatabase.driver()<\/code> function, which is the same approach used in the backup script.<\/p>\n<pre lang=\"python\">\r\nclear_database = GRAPH.run(cypher.DELETE_ALL_RECORDS)\r\nencrypted = False\r\ntrust = \"TRUST_ALL_CERTIFICATES\"\r\ndriver = GraphDatabase.driver(HOSTNAME, auth=(USERNAME, PASSWORD), encrypted=encrypted, trust=trust)\r\n<\/pre>\n<p><code>Note:<\/code> As mentioned previously, the encryption is set to false only because the database is on the local system. However, if the database exists on a remote server, make sure to set the encryption to true.<\/p>\n<p>8. Finally let&#8217;s import the synthetic data to the database by creating an instance of the <code>Importer<\/code> class from the <code>neo4j_backup<\/code> library and calling the <code>import_data()<\/code> method.<\/p>\n<pre lang=\"neo4j\">\r\n#Import the data to the database\r\nimporter = Importer(project_dir=extract_dir, driver=driver, database=database, input_yes= False)\r\nimporter.import_data()\r\n<\/pre>\n<h5>C. Putting it all together<\/h5>\n<p>That\u2019s it! We are finally done with our coding. Our complete code should look similar to this:<\/p>\n<pre lang=\"python\">\r\n\"\"\"\r\nClear the existing database and restore the Neo4j backup \r\n\"\"\"\r\nimport os\r\nimport sys\r\nfrom neo4j import GraphDatabase\r\nfrom neo4j_backup import Importer\r\nfrom py2neo import Graph\r\nfrom decouple import config\r\nimport cypher\r\nfrom zipfile import ZipFile\r\nimport argparse\r\n\r\n# Grabbing environment variables\r\nHOSTNAME = config(\"DATABASE_HOST\")\r\nUSERNAME = config(\"DATABASE_USERNAME\")\r\nPASSWORD = config(\"DATABASE_PASSWORD\")\r\n\r\ndef auth():\r\n    \"Authenticating with the Database\"\r\n    GRAPH = None\r\n    try:\r\n        GRAPH = Graph(HOSTNAME, auth=(USERNAME, PASSWORD))\r\n        print(\"Database authenticated\")\r\n    except Exception as error:\r\n        raise RuntimeError('Database authentication failed') from error\r\n    return GRAPH\r\n\r\ndef unzip_file(IMPORT_FILE,extract_dir):\r\n    \"Unzip the import files\"\r\n    with ZipFile(IMPORT_FILE, 'r') as zip:\r\n        # printing all the contents of the zip file\r\n        zip.printdir()    \r\n        # extracting all the files\r\n        print('Extracting all the files now...')\r\n        zip.extractall(extract_dir)\r\n        print('Done!')\r\n\r\nif __name__ == \"__main__\":\r\n\r\n    #Add command line arguments to fetch import file and database name\r\n    parser = argparse.ArgumentParser()\r\n    #Command line argument to fetch import file. File name is taken as 'synthetic_data.zip' if no argument is specified\r\n    parser.add_argument(\"--import_file\", type=str, help=\"Provide the import file zip\",\r\n                        nargs='?', default=\"synthetic_data.zip\", const=0)\r\n    #Command line argument to fetch database name. Database name is taken as 'neo4j' if no argument is specified\r\n    parser.add_argument(\"--database_name\", type=str, help=\"Provide the name of the database\",\r\n                        nargs='?', default=\"neo4j\", const=0)\r\n    args = parser.parse_args()\r\n    IMPORT_FILE = args.import_file\r\n    database = args.database_name\r\n\r\n    #Get the directory into which the archive will be extracted\r\n    extract_dir = os.path.splitext(IMPORT_FILE)[0]    \r\n    unzip_file(IMPORT_FILE,extract_dir)\r\n\r\n    GRAPH = auth()\r\n\r\n    #Clear the existing data in database\r\n    clear_database = GRAPH.run(cypher.DELETE_ALL_RECORDS)\r\n    encrypted = False\r\n    trust = \"TRUST_ALL_CERTIFICATES\"\r\n    driver = GraphDatabase.driver(HOSTNAME, auth=(USERNAME, PASSWORD), encrypted=encrypted, trust=trust)\r\n\r\n    #Import the data to the database\r\n    importer = Importer(project_dir=extract_dir, driver=driver, database=database, input_yes=False)\r\n    importer.import_data()\r\n<\/pre>\n<h4>Restoring and verifying the data<\/h4>\n<p>1. To run the restore script, first make sure that the Neo4j database you want to restore is up and running.<br \/>\n2. Open a terminal, navigate to the location of the <em>neo4j_restore_script.py<\/em> file and run the following command:<\/p>\n<pre lang=\"python\">\r\npython neo4j_restore_script.py --database_name <database_name> --import_file <backup_file>\r\n<\/pre>\n<p>Make sure to replace <code>database_name<\/code> with the name of the database you wish to restore the data into, and <code>backup_file<\/code> with the the archived backup file that was created by the backup script.<br \/>\n<figure id=\"attachment_17724\" aria-describedby=\"caption-attachment-17724\" style=\"width: 900px\" class=\"wp-caption alignnone\"><a href=\"https:\/\/qxf2.com\/blog\/wp-content\/uploads\/2023\/02\/neo4j_restore.png\" data-rel=\"lightbox-image-4\" data-rl_title=\"\" data-rl_caption=\"\" title=\"\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/qxf2.com\/blog\/wp-content\/uploads\/2023\/02\/neo4j_restore-1024x237.png\" alt=\"Execution of Neo4j database restore script.\" width=\"900\" height=\"208\" class=\"size-large wp-image-17724\" srcset=\"https:\/\/qxf2.com\/blog\/wp-content\/uploads\/2023\/02\/neo4j_restore-1024x237.png 1024w, https:\/\/qxf2.com\/blog\/wp-content\/uploads\/2023\/02\/neo4j_restore-300x69.png 300w, https:\/\/qxf2.com\/blog\/wp-content\/uploads\/2023\/02\/neo4j_restore-768x177.png 768w, https:\/\/qxf2.com\/blog\/wp-content\/uploads\/2023\/02\/neo4j_restore-1536x355.png 1536w, https:\/\/qxf2.com\/blog\/wp-content\/uploads\/2023\/02\/neo4j_restore.png 1891w\" sizes=\"auto, (max-width: 900px) 100vw, 900px\" \/><\/a><figcaption id=\"caption-attachment-17724\" class=\"wp-caption-text\">Restoring synthetic data to Neo4j database<\/figcaption><\/figure><\/p>\n<p>3. Once the script finish running, check your database. If everything went smoothly, your  database should be populated with the synthetic data that we created.<\/p>\n<p>Hurray! We have successfully restored our synthetic data! <\/p>\n<h3>Further reading<\/h3>\n<p>If you&#8217;re interested in checking out how we can setup tests for the synthetic data that we created you can checkout out the test &#8220;<em><a href=\"https:\/\/github.com\/qxf2\/qxf2-survey\/blob\/master\/backend\/tests\/parallel_tests\/test_responses_between_dates.py\" rel=\"noopener\" target=\"_blank\">test_responses_between_dates.py<\/a><\/em>&#8221; on our <a href=\"https:\/\/github.com\/qxf2\/qxf2-survey\" rel=\"noopener\" target=\"_blank\">Qxf2-survey<\/a> GitHub repository.<br \/>\nTo run the test, you will need to have the <a href=\"https:\/\/github.com\/qxf2\/qxf2-survey\" rel=\"noopener\" target=\"_blank\">Qxf2-survey<\/a> app set up on your machine. You can follow the instructions provided in the Qxf2-survey app&#8217;s Github documentation to set up the survey app on your local system.<\/p>\n<hr>\n<h3>Hire technical testers from Qxf2<\/h3>\n<p>Qxf2 is the home of the technical tester. As you can see from this post, our testers go beyond the regular &#8220;manual&#8221; or &#8220;automation&#8221; paradigm of testing. We are engineers who test software. We lay the groundwork for testing, improve testability and enable your entire team to participate in testing. If you are looking for testers with solid engineering backgrounds, reach out to us <a href=\"https:\/\/qxf2.com\/contact?utm_source=syntheticNeo4j&#038;utm_medium=click&#038;utm_campaign=From%20blog\">here<\/a>.<\/p>\n<hr>\n","protected":false},"excerpt":{"rendered":"<p>Qxf2 engineers are fans of using synthetic data for testing. We have used this technique for years now. But until recently, our experience was limited to SQL. We needed to work with neo4j in one of our projects. There was enough custom Python code and Cypher queries that we had to write as part of creating, backing up and restoring [&hellip;]<\/p>\n","protected":false},"author":29,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[330,331,18,329],"tags":[],"class_list":["post-17608","post","type-post","status-publish","format-standard","hentry","category-neo4j","category-neo4j-backup","category-python","category-synthetic-data"],"_links":{"self":[{"href":"https:\/\/qxf2.com\/blog\/wp-json\/wp\/v2\/posts\/17608","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/qxf2.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/qxf2.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/qxf2.com\/blog\/wp-json\/wp\/v2\/users\/29"}],"replies":[{"embeddable":true,"href":"https:\/\/qxf2.com\/blog\/wp-json\/wp\/v2\/comments?post=17608"}],"version-history":[{"count":91,"href":"https:\/\/qxf2.com\/blog\/wp-json\/wp\/v2\/posts\/17608\/revisions"}],"predecessor-version":[{"id":17812,"href":"https:\/\/qxf2.com\/blog\/wp-json\/wp\/v2\/posts\/17608\/revisions\/17812"}],"wp:attachment":[{"href":"https:\/\/qxf2.com\/blog\/wp-json\/wp\/v2\/media?parent=17608"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/qxf2.com\/blog\/wp-json\/wp\/v2\/categories?post=17608"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/qxf2.com\/blog\/wp-json\/wp\/v2\/tags?post=17608"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}