{"id":11205,"date":"2019-08-12T07:54:49","date_gmt":"2019-08-12T11:54:49","guid":{"rendered":"https:\/\/qxf2.com\/blog\/?p=11205"},"modified":"2019-08-12T07:54:49","modified_gmt":"2019-08-12T11:54:49","slug":"anonymize-data-using-faker","status":"publish","type":"post","link":"https:\/\/qxf2.com\/blog\/anonymize-data-using-faker\/","title":{"rendered":"Anonymize data using Python Faker"},"content":{"rendered":"<p>We have an application which holds some sensitive survey information. Now since we couldn&#8217;t share the data with everyone we wanted a good way to Anonymize the data so that we can handover the development to anyone. We found Faker as a good library that generates fake data.<\/p>\n<hr \/>\n<h3>Why this post<\/h3>\n<p>As a tester, we may need to work with real data but in case the data has some sensitive information, it may be not possible to use it directly. In such cases anonymizing the data becomes important. In this post, you will learn how to Anonymize the data using Faker<\/p>\n<hr \/>\n<h3>What is Faker?<\/h3>\n<p>Faker is a Python library that generates fake data for you. You can use it to Anonymize your production data, create dummy data for testing by filling it in your DB, etc<\/p>\n<hr \/>\n<h3>Installation<\/h3>\n<p>To install faker you can simply run<\/p>\n<pre lang=\"python\">pip install\u00a0Faker<\/pre>\n<p>Next, as part of our\u00a0 example, we will be dealing with CSV data so to iterate over the csv data we need to install a python library called Unicode CSV by running<\/p>\n<pre lang=\"python\">pip install unicodecsv<\/pre>\n<hr \/>\n<h3>Working example to anonymize data from a csv file<\/h3>\n<h5>Contents of original_data.csv file<\/h5>\n<p><a href=\"https:\/\/qxf2.com\/blog\/wp-content\/uploads\/2019\/08\/original_data_csv.png\" data-rel=\"lightbox-image-0\" data-rl_title=\"\" data-rl_caption=\"\" title=\"\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-medium wp-image-11306\" src=\"https:\/\/qxf2.com\/blog\/wp-content\/uploads\/2019\/08\/original_data_csv-300x144.png\" alt=\"\" width=\"300\" height=\"144\" srcset=\"https:\/\/qxf2.com\/blog\/wp-content\/uploads\/2019\/08\/original_data_csv-300x144.png 300w, https:\/\/qxf2.com\/blog\/wp-content\/uploads\/2019\/08\/original_data_csv.png 345w\" sizes=\"auto, (max-width: 300px) 100vw, 300px\" \/><\/a><\/p>\n<p>Consider we have a CSV file named original_data which has 3 columns namely Id, Name and Email with 5 records for the same as shown in the screenshot above of which we will be anonymizing the cell values for Name and Email columns by making use of Faker library<\/p>\n<p>Create a file named anonymize_data.py with the following content:<\/p>\n<pre lang=\"python\">\"\"\"\r\nThis script will Anonymize the data in original_data.csv file to Anonymized form in anonymized_data csv file\r\n\"\"\"\r\n\r\nimport unicodecsv as csv\r\nfrom faker import Faker\r\nfrom collections import defaultdict\r\n\r\n\r\ndef anonymize():\r\n\t'Anonymizes the given original data to anonymized form'\r\n\t# Load the faker and its providers\r\n\tfaker  = Faker()\r\n\r\n\t# Create mappings of names &amp; emails to faked names &amp; emails.\r\n\tnames  = defaultdict(faker.name)\r\n\temails = defaultdict(faker.email)\r\n    \r\n\twith open(\"original_data.csv\", 'rU') as f:       \r\n\t    with open(\"anonymized_data.csv\", 'wb') as o :\r\n\t    # Use the DictReader to easily extract fields\r\n\t        reader = csv.DictReader(f)\r\n\t        writer = csv.DictWriter(o, reader.fieldnames)\r\n\t        writer.writeheader()\r\n\t        for row in reader:\r\n\t            row['name']  = names[row['name']]\r\n\t            row['email'] = emails[row['email']]\r\n\t            writer.writerow(row)\r\n\r\nif __name__ == '__main__':\r\n    anonymize()\r\n<\/pre>\n<p>As you can see in the code you can simply fake a name by using the method &#8211; faker.name.\u00a0Each call to method faker.name() yields a different (random) result similarly for emails. Faker object has around 158 different methods all of which generates fake data depending on users need.\u00a0<span>Faker delegates the data generation to providers. The default object provider uses the English locale. Faker supports other locales; they differ in the level of completion. If you wish to use some other locale provider then you can visit &#8211;\u00a0<a href=\"https:\/\/faker.readthedocs.io\/en\/master\/locales.html#\">Faker Locales<\/a><\/span><\/p>\n<p>You can run the script with <\/p>\n<pre lang=\"python\">python anonymize_data.py<\/pre>\n<p> which would generate an anonymized_data.csv in the same directory of your python script with your anonymized data<\/p>\n<h5>Contents of newly generated anonymized_data.csv<\/h5>\n<p><a href=\"https:\/\/qxf2.com\/blog\/wp-content\/uploads\/2019\/08\/anonymized_data_csv.png\" data-rel=\"lightbox-image-1\" data-rl_title=\"\" data-rl_caption=\"\" title=\"\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-medium wp-image-11307\" src=\"https:\/\/qxf2.com\/blog\/wp-content\/uploads\/2019\/08\/anonymized_data_csv-300x103.png\" alt=\"\" width=\"300\" height=\"103\" srcset=\"https:\/\/qxf2.com\/blog\/wp-content\/uploads\/2019\/08\/anonymized_data_csv-300x103.png 300w, https:\/\/qxf2.com\/blog\/wp-content\/uploads\/2019\/08\/anonymized_data_csv.png 402w\" sizes=\"auto, (max-width: 300px) 100vw, 300px\" \/><\/a><\/p>\n<p>You can see that anonymized_data.csv file has a similar number of rows, length and also field name only difference is that names and emails have been replaced with anonymized names and emails<\/p>\n<hr \/>\n<p>Hope this post helped you to learn how with some lines of code you can easily fake a dataset. So next time when your team says they can&#8217;t use real time data since it has sensitive info, share your knowledge on this library and get the data rolling. Happy testing&#8230;<\/p>\n","protected":false},"excerpt":{"rendered":"<p>We have an application which holds some sensitive survey information. Now since we couldn&#8217;t share the data with everyone we wanted a good way to Anonymize the data so that we can handover the development to anyone. We found Faker as a good library that generates fake data. Why this post As a tester, we may need to work with [&hellip;]<\/p>\n","protected":false},"author":17,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[198,18],"tags":[],"class_list":["post-11205","post","type-post","status-publish","format-standard","hentry","category-faker","category-python"],"_links":{"self":[{"href":"https:\/\/qxf2.com\/blog\/wp-json\/wp\/v2\/posts\/11205","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/qxf2.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/qxf2.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/qxf2.com\/blog\/wp-json\/wp\/v2\/users\/17"}],"replies":[{"embeddable":true,"href":"https:\/\/qxf2.com\/blog\/wp-json\/wp\/v2\/comments?post=11205"}],"version-history":[{"count":23,"href":"https:\/\/qxf2.com\/blog\/wp-json\/wp\/v2\/posts\/11205\/revisions"}],"predecessor-version":[{"id":11336,"href":"https:\/\/qxf2.com\/blog\/wp-json\/wp\/v2\/posts\/11205\/revisions\/11336"}],"wp:attachment":[{"href":"https:\/\/qxf2.com\/blog\/wp-json\/wp\/v2\/media?parent=11205"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/qxf2.com\/blog\/wp-json\/wp\/v2\/categories?post=11205"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/qxf2.com\/blog\/wp-json\/wp\/v2\/tags?post=11205"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}