{"id":5635,"date":"2017-05-30T05:57:44","date_gmt":"2017-05-30T09:57:44","guid":{"rendered":"https:\/\/qxf2.com\/blog\/?p=5635"},"modified":"2018-04-03T10:07:55","modified_gmt":"2018-04-03T14:07:55","slug":"cleaning-data-python-pandas","status":"publish","type":"post","link":"https:\/\/qxf2.com\/blog\/cleaning-data-python-pandas\/","title":{"rendered":"Cleaning data with Python"},"content":{"rendered":"<p>I am sharing some tips and tricks on cleaning data and restructuring the data you are using for testing.<\/p>\n<h3> Why this post?<\/h3>\n<p>Qxf2 works with many data intensive applications. I&#8217;ve noticed a pattern &#8211; my colleagues hit very similar data related problems across many different projects. That got me thinking critically about test data. I was thrilled to stumble upon an excellent paper on <a href=\"http:\/\/vita.had.co.nz\/papers\/tidy-data.html\">Tidy Data<\/a> that has helped me understand and articulate the problems with poorly structured data better. I&#8217;m sure a lot of people in the world of data science already know about this paper. But this blog is geared towards testers and I don&#8217;t know of too many testers who have been exposed to the idea of clean data. So I thought it would be worth writing a post to expose testers to the concept of tidy data. In addition to tidying data, I thought I would give you some examples of cleaning operations we have used on various projects. <\/p>\n<hr>\n<h3>Should QA think about designing data?<\/h3>\n<p>Yes. We think so. It has worked very well for us. Qxf2 has been testing many data intensive applications. My colleagues have tested survey applications, data querying and visualization tools, healthcare algorithms, patient data, etc. We work with startups and early stage software. So, usually, we have the opportunity to study the data carefully and design our own data sets for testing purposes. This practice of designing rich data sets to enable testing, as opposed to using what data exists on production, has helped our engineering colleagues move faster and made our applications and algorithms be more robust to mistakes in data. A couple of technical auditors (technical audits happen when our clients seek investments\/acquisitions) have praised QA&#8217;s approach to using designed test data. Designed test data also helps when testing the UI and makes for a good demo. So, usually, the time you spend designing good data is worth it.  <\/p>\n<hr>\n<h3>How does a tester benefit from this post?<\/h3>\n<p>This post will help you think deeper about the data being used for testing. I think you can use the tips and tricks in this post in a few ways:<br \/>\na) If you are completely new to working with data, this post is a good starting point<br \/>\nb) When you are testing parsers, you can purposely create &#8216;dirty&#8217; data to ensure the parser remains robust and that the error messages are human-friendly.<br \/>\nc) When you are testing algorithms, you can design test data so that you test not only the key pieces of the algorithm but also test the algorithm&#8217;s validity against missing pieces of data, incorrect data, updated data, etc.<br \/>\nd) If you are scraping data and using it for your testing, you will learn to structure your data better<\/p>\n<hr>\n<h3>Setup for data cleaning<\/h3>\n<p>I&#8217;ve tried my best to keep this blog post self-contained. I&#8217;ve created a simple data set with some common mistakes in it. By the end of this post, we would have cleaned this data set. You can download patient_heart_rate.csv over <a href=\"https:\/\/github.com\/qxf2\/python-data-cleaning-blogpost\">here<\/a>. I am using <a href=\"http:\/\/pandas.pydata.org\">Python Pandas<\/a> to clean the data. But you are welcome to use just plain old Python and write your own code instead of using my snippets. The goal is to learn about the common difficulties with cleaning data rather than to learn the Python code used to clean the data.<\/p>\n<hr>\n<h3>Get to know the data<\/h3>\n<p>In any process involving data, before we start working, our first goal should always be understanding what the data is and what we want to achieve. This process involves looking at the data and trying to understand the columns\/rows of the data, no of records, the format of the data, semantic errors, missing entries, or inconsistent formatting, so it needs to be &#8220;cleaned&#8221; prior to analysis.<\/p>\n<p>In this post, we will summarize some data tidying examples using Python libraries. For this purpose, we have created a patient_heart_rate.csv dataset as an example. The data set has been kept small enough for you to be able to grok it all at once. The data is in csv format. Each row in the dataset has data about different individuals and their heart rate details for different time intervals. The columns contain information such as individual&#8217;s Age, Weight, Sex and Heart Rates taken at different time intervals. <\/p>\n<p>Here is how our dataset looks:<br \/>\n<a href=\"https:\/\/qxf2.com\/blog\/wp-content\/uploads\/2017\/05\/Capture.png\" data-rel=\"lightbox-image-0\" data-rl_title=\"\" data-rl_caption=\"\" title=\"\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/qxf2.com\/blog\/wp-content\/uploads\/2017\/05\/Capture.png\" alt=\"\" width=\"600\" height=\"232\" class=\"aligncenter size-full wp-image-5780\" \/><\/a><\/p>\n<hr\/>\n<h3>Problems in our dataset<\/h3>\n<p>For illustration purpose in the above dataset we have created each row with a problem as listed below:<\/p>\n<p>1. Missing headers in the csv file<br \/>\n2. Multiple variables are stored in one column<br \/>\n3. Column data contains inconsistent unit values<br \/>\n4. Missing values<br \/>\n5. An empty row in the data<br \/>\n6. Duplicate records in the data<br \/>\n7. Non-ASCII characters<br \/>\n8. Column headers are values and not variable names<\/p>\n<hr \/>\n<h3>Cleaning Data<\/h3>\n<p>The following sections illustrate each problem in detail and show how to tidy them using Python Pandas library. <\/p>\n<p><strong>Problem 1 &#8211; Missing headers in the csv file<\/strong><br \/>\nIf you look at the data file you notice that there is no header (See Fig 1). Pandas is smart enough to figure out that the first line of the file is the header. We can still read the file if the csv file doesn&#8217;t have a header by manually providing the headers. Have a look at the below code.<\/p>\n<pre lang=\"Python\">\r\nimport pandas as pd\r\n\r\ncolumn_names= [\"Id\", \"Name\", \"Age\", \"Weight\",'m0006','m0612','m1218','f0006','f0612','f1218']\r\ndf = pd.read_csv(\"patient_heart_rate.csv\", names = column_names)\r\ndf.head()\r\n<\/pre>\n<p>As shown above we can declare our own headers. We read the data in the same way, except this time we pass in a new variable names = column_names, to tell Pandas to use our headers, as the original file has none. See Fig 2<\/p>\n<figure id=\"attachment_5774\" aria-describedby=\"caption-attachment-5774\" style=\"width: 608px\" class=\"wp-caption aligncenter\"><a href=\"https:\/\/qxf2.com\/blog\/wp-content\/uploads\/2017\/05\/missingheader.png\" data-rel=\"lightbox-image-1\" data-rl_title=\"\" data-rl_caption=\"\" title=\"\"><img loading=\"lazy\" decoding=\"async\" class=\"size-full wp-image-5774\" src=\"https:\/\/qxf2.com\/blog\/wp-content\/uploads\/2017\/05\/missingheader.png\" alt=\"\" width=\"608\" height=\"196\" srcset=\"https:\/\/qxf2.com\/blog\/wp-content\/uploads\/2017\/05\/missingheader.png 608w, https:\/\/qxf2.com\/blog\/wp-content\/uploads\/2017\/05\/missingheader-300x97.png 300w\" sizes=\"auto, (max-width: 608px) 100vw, 608px\" \/><\/a><figcaption id=\"caption-attachment-5774\" class=\"wp-caption-text\">Fig 2 &#8211; Added headers to the csv file<\/figcaption><\/figure>\n<hr \/>\n<p><strong>Problem 2 &#8211; Multiple variables are stored in one column<\/strong><br \/>\nIn this dataset patient_heart_rate, the Name column contains two variables Lastname and Firstname(See Fig 3). To respect the tidy principles, we want to isolate each variable in a single column.<\/p>\n<pre lang=\"python\">import pandas as pd\r\ndf = pd.read_csv(\"patient_heart_rate.csv\")\r\nprint (df)<\/pre>\n<figure id=\"attachment_5770\" aria-describedby=\"caption-attachment-5770\" style=\"width: 615px\" class=\"wp-caption aligncenter\"><a href=\"https:\/\/qxf2.com\/blog\/wp-content\/uploads\/2017\/05\/Name_before_separation1.png\" data-rel=\"lightbox-image-2\" data-rl_title=\"\" data-rl_caption=\"\" title=\"\"><img loading=\"lazy\" decoding=\"async\" class=\"size-full wp-image-5770\" src=\"https:\/\/qxf2.com\/blog\/wp-content\/uploads\/2017\/05\/Name_before_separation1.png\" alt=\"\" width=\"615\" height=\"194\" srcset=\"https:\/\/qxf2.com\/blog\/wp-content\/uploads\/2017\/05\/Name_before_separation1.png 615w, https:\/\/qxf2.com\/blog\/wp-content\/uploads\/2017\/05\/Name_before_separation1-300x95.png 300w\" sizes=\"auto, (max-width: 615px) 100vw, 615px\" \/><\/a><figcaption id=\"caption-attachment-5770\" class=\"wp-caption-text\">Fig 3 \u2013 Name column having First and Last Name<\/figcaption><\/figure>\n<p>Technically, separating those two variables can be accomplished using the split function.<br \/>\nWe used the vectorised str.split with expand=True, this will unpack the list into the new columns as shown below.<\/p>\n<pre lang=\"python\">df[['Firstname','Lastname']] = df['Name'].str.split(expand=True)\r\ndf = df.drop('Name', axis=1)\r\nprint (df)<\/pre>\n<p>Once you execute the above code, the data looks like this(Fig 4)<\/p>\n<figure id=\"attachment_5760\" aria-describedby=\"caption-attachment-5760\" style=\"width: 627px\" class=\"wp-caption aligncenter\"><a href=\"https:\/\/qxf2.com\/blog\/wp-content\/uploads\/2017\/05\/Name_after_Separation1-2.png\" data-rel=\"lightbox-image-3\" data-rl_title=\"\" data-rl_caption=\"\" title=\"\"><img loading=\"lazy\" decoding=\"async\" class=\"size-full wp-image-5760\" src=\"https:\/\/qxf2.com\/blog\/wp-content\/uploads\/2017\/05\/Name_after_Separation1-2.png\" alt=\"\" width=\"627\" height=\"196\" srcset=\"https:\/\/qxf2.com\/blog\/wp-content\/uploads\/2017\/05\/Name_after_Separation1-2.png 627w, https:\/\/qxf2.com\/blog\/wp-content\/uploads\/2017\/05\/Name_after_Separation1-2-300x94.png 300w\" sizes=\"auto, (max-width: 627px) 100vw, 627px\" \/><\/a><figcaption id=\"caption-attachment-5760\" class=\"wp-caption-text\">Fig 4 \u2013 First Name and Last Name as separate columns<\/figcaption><\/figure>\n<hr \/>\n<p><strong>Problem 3 \u2013 Column data contains inconsistent unit values<\/strong><br \/>\nIf you observe in the dataset, the way the measurement unit was written for Weight column is not consistent. The second and seventh row contains data for inconsistent unit values (kgs,lbs). See Fig 5<\/p>\n<figure id=\"attachment_5761\" aria-describedby=\"caption-attachment-5761\" style=\"width: 636px\" class=\"wp-caption aligncenter\"><a href=\"https:\/\/qxf2.com\/blog\/wp-content\/uploads\/2017\/05\/weight_before_conversion1.png\" data-rel=\"lightbox-image-4\" data-rl_title=\"\" data-rl_caption=\"\" title=\"\"><img loading=\"lazy\" decoding=\"async\" class=\"size-full wp-image-5761\" src=\"https:\/\/qxf2.com\/blog\/wp-content\/uploads\/2017\/05\/weight_before_conversion1.png\" alt=\"\" width=\"636\" height=\"196\" srcset=\"https:\/\/qxf2.com\/blog\/wp-content\/uploads\/2017\/05\/weight_before_conversion1.png 636w, https:\/\/qxf2.com\/blog\/wp-content\/uploads\/2017\/05\/weight_before_conversion1-300x92.png 300w\" sizes=\"auto, (max-width: 636px) 100vw, 636px\" \/><\/a><figcaption id=\"caption-attachment-5761\" class=\"wp-caption-text\">Fig 5 \u2013 Weight column not consistent(shown in kgs and lbs)<\/figcaption><\/figure>\n<p>The weight column is sometimes spelled as kgs, lbs etc., In order to resolve this we have written a python code to convert the measuring unit and represent weight column in kgs<\/p>\n<pre lang=\"python\">import pandas as pd\r\n#Read the data set\r\ndf = pd.read_csv(\"patient_heart_rate.csv\")\r\n#Get the Weight column\r\nweight = df['Weight']\r\n\r\nfor i in range (0 ,len(weight)):    \r\n    x= str(weight[i])\r\n    #Incase lbs is part of observation remove it\r\n    if \"lbs\" in x[-3:]:\r\n        #Remove the lbs from the value\r\n        x = x[:-3:]\r\n        #Convert string to float\r\n        float_x = float(x)\r\n        #Covert to kgs and store as int\r\n        y =int(float_x\/2.2)\r\n        #Convert back to string\r\n        y = str(y)+\"kgs\"\r\n        weight[i]= y\r\nprint (df)<\/pre>\n<p>This results in a tidy dataset. See Fig 6<\/p>\n<figure id=\"attachment_5762\" aria-describedby=\"caption-attachment-5762\" style=\"width: 597px\" class=\"wp-caption aligncenter\"><a href=\"https:\/\/qxf2.com\/blog\/wp-content\/uploads\/2017\/05\/weight_after_conversion1.png\" data-rel=\"lightbox-image-5\" data-rl_title=\"\" data-rl_caption=\"\" title=\"\"><img loading=\"lazy\" decoding=\"async\" class=\"size-full wp-image-5762\" src=\"https:\/\/qxf2.com\/blog\/wp-content\/uploads\/2017\/05\/weight_after_conversion1.png\" alt=\"\" width=\"597\" height=\"198\" srcset=\"https:\/\/qxf2.com\/blog\/wp-content\/uploads\/2017\/05\/weight_after_conversion1.png 597w, https:\/\/qxf2.com\/blog\/wp-content\/uploads\/2017\/05\/weight_after_conversion1-300x99.png 300w\" sizes=\"auto, (max-width: 597px) 100vw, 597px\" \/><\/a><figcaption id=\"caption-attachment-5762\" class=\"wp-caption-text\">Fig 6 \u2013 Weight column represented in kgs<\/figcaption><\/figure>\n<hr \/>\n<p><strong>Problem 4 &#8211; Missing values<\/strong><br \/>\nThere are few missing values in the Age, Weight and Heart Rate values in the dataset. It is quite common for some values to be missing from datasets. This typically means that a piece of information was simply not collected. There are several options for handling missing data. But we encourage testers to <strong>talk to your subject matter experts<\/strong> about your missing data. Typical ways your subject matter experts will suggest handling missing values are:<br \/>\n<strong>a. Deletion:<\/strong> Remove records with missing values<br \/>\n<strong>b. Dummy substitution:<\/strong> Replace missing values with a dummy but valid value: e.g.: 0 for numerical values.<br \/>\n<strong>c. Mean substitution:<\/strong> Replace the missing values with the mean.<br \/>\n<strong>d. Frequent substitution:<\/strong> Replace the missing values with the most frequent item.<br \/>\n<strong>e. Improve the data collector:<\/strong> Your business folk will talk to the clients and inform them about why it is worth fixing the problem with the data collector.<\/p>\n<hr \/>\n<p><strong>Problem 5 &#8211; An empty row in the data<\/strong><br \/>\nIf you observe in the dataset, you can see that there is a Empty row. If you read this file with Pandas library, and look at the content of your dataframe, you have a empty row that has been filled with NaNs.See Fig 7<\/p>\n<pre lang=\"python\">import pandas as pd\r\ndf = pd.read_csv(\"patient_heart_rate.csv\")\r\nprint(df)\r\n<\/pre>\n<figure id=\"attachment_5763\" aria-describedby=\"caption-attachment-5763\" style=\"width: 595px\" class=\"wp-caption aligncenter\"><a href=\"https:\/\/qxf2.com\/blog\/wp-content\/uploads\/2017\/05\/empty_rows_before1.png\" data-rel=\"lightbox-image-6\" data-rl_title=\"\" data-rl_caption=\"\" title=\"\"><img loading=\"lazy\" decoding=\"async\" class=\"size-full wp-image-5763\" src=\"https:\/\/qxf2.com\/blog\/wp-content\/uploads\/2017\/05\/empty_rows_before1.png\" alt=\"\" width=\"595\" height=\"196\" srcset=\"https:\/\/qxf2.com\/blog\/wp-content\/uploads\/2017\/05\/empty_rows_before1.png 595w, https:\/\/qxf2.com\/blog\/wp-content\/uploads\/2017\/05\/empty_rows_before1-300x99.png 300w\" sizes=\"auto, (max-width: 595px) 100vw, 595px\" \/><\/a><figcaption id=\"caption-attachment-5763\" class=\"wp-caption-text\">Fig 7 \u2013 Record with Empty Lines<\/figcaption><\/figure>\n<p>There is no option to ignore the row in the function read_csv(), so we need to handle this using dropna() method as shown in below code.<\/p>\n<pre lang=\"python\">df.dropna(how=\"all\", inplace=True)<\/pre>\n<p>After executing this code, the empty row is removed. See Fig 8<\/p>\n<figure id=\"attachment_5764\" aria-describedby=\"caption-attachment-5764\" style=\"width: 591px\" class=\"wp-caption aligncenter\"><a href=\"https:\/\/qxf2.com\/blog\/wp-content\/uploads\/2017\/05\/empty_rows_after1.png\" data-rel=\"lightbox-image-7\" data-rl_title=\"\" data-rl_caption=\"\" title=\"\"><img loading=\"lazy\" decoding=\"async\" class=\"size-full wp-image-5764\" src=\"https:\/\/qxf2.com\/blog\/wp-content\/uploads\/2017\/05\/empty_rows_after1.png\" alt=\"\" width=\"591\" height=\"180\" srcset=\"https:\/\/qxf2.com\/blog\/wp-content\/uploads\/2017\/05\/empty_rows_after1.png 591w, https:\/\/qxf2.com\/blog\/wp-content\/uploads\/2017\/05\/empty_rows_after1-300x91.png 300w\" sizes=\"auto, (max-width: 591px) 100vw, 591px\" \/><\/a><figcaption id=\"caption-attachment-5764\" class=\"wp-caption-text\">Fig 8 \u2013 Empty Lines Removed<\/figcaption><\/figure>\n<hr \/>\n<p><strong>Problem 6 \u2013 Duplicate records in the data<\/strong><br \/>\nSometimes, in the dataset there may be some duplicate records. In our dataset there is a record which is repeated in 6th and 9th rows. See Fig 9 for duplicate records.<\/p>\n<pre lang=\"python\"> import pandas as pd\r\n#Read the data set\r\ndf = pd.read_csv(\"patient_heart_rate.csv\")\r\nprint (df)<\/pre>\n<figure id=\"attachment_5767\" aria-describedby=\"caption-attachment-5767\" style=\"width: 599px\" class=\"wp-caption aligncenter\"><a href=\"https:\/\/qxf2.com\/blog\/wp-content\/uploads\/2017\/05\/duplicate_records_before1-2.png\" data-rel=\"lightbox-image-8\" data-rl_title=\"\" data-rl_caption=\"\" title=\"\"><img loading=\"lazy\" decoding=\"async\" class=\"size-full wp-image-5767\" src=\"https:\/\/qxf2.com\/blog\/wp-content\/uploads\/2017\/05\/duplicate_records_before1-2.png\" alt=\"\" width=\"599\" height=\"184\" srcset=\"https:\/\/qxf2.com\/blog\/wp-content\/uploads\/2017\/05\/duplicate_records_before1-2.png 599w, https:\/\/qxf2.com\/blog\/wp-content\/uploads\/2017\/05\/duplicate_records_before1-2-300x92.png 300w\" sizes=\"auto, (max-width: 599px) 100vw, 599px\" \/><\/a><figcaption id=\"caption-attachment-5767\" class=\"wp-caption-text\">Fig 9 \u2013 Duplicate Records<\/figcaption><\/figure>\n<p>First, check if you have duplicate records. If duplicate records exist, then you can use the Pandas function drop_duplicates() to remove the duplicate records.<\/p>\n<pre lang=\"python\">\r\n#Use drop_duplicates method to remove duplicate records. Using subset helps to consider certain columns for identifying duplicates.\r\ndf = df.drop_duplicates(subset=['Firstname','Lastname'])\r\nprint (df)\r\n<\/pre>\n<p>Once this code is executed, the duplicate records get deleted. See Fig 10<\/p>\n<figure id=\"attachment_5768\" aria-describedby=\"caption-attachment-5768\" style=\"width: 592px\" class=\"wp-caption aligncenter\"><a href=\"https:\/\/qxf2.com\/blog\/wp-content\/uploads\/2017\/05\/duplicate_records_after1.png\" data-rel=\"lightbox-image-9\" data-rl_title=\"\" data-rl_caption=\"\" title=\"\"><img loading=\"lazy\" decoding=\"async\" class=\"size-full wp-image-5768\" src=\"https:\/\/qxf2.com\/blog\/wp-content\/uploads\/2017\/05\/duplicate_records_after1.png\" alt=\"\" width=\"592\" height=\"167\" srcset=\"https:\/\/qxf2.com\/blog\/wp-content\/uploads\/2017\/05\/duplicate_records_after1.png 592w, https:\/\/qxf2.com\/blog\/wp-content\/uploads\/2017\/05\/duplicate_records_after1-300x85.png 300w\" sizes=\"auto, (max-width: 592px) 100vw, 592px\" \/><\/a><figcaption id=\"caption-attachment-5768\" class=\"wp-caption-text\">Fig 10 \u2013 Duplicate Records Removed<\/figcaption><\/figure>\n<hr \/>\n<p><strong>Problem 7 &#8211; Non-ASCII characters<\/strong><br \/>\nThere are few non-ASCII characters in the Firstname and Lastname columns in the data. See Fig 11 below<\/p>\n<figure id=\"attachment_5768\" aria-describedby=\"caption-attachment-5768\" style=\"width: 592px\" class=\"wp-caption aligncenter\"><a href=\"https:\/\/qxf2.com\/blog\/wp-content\/uploads\/2017\/05\/duplicate_records_after1.png\" data-rel=\"lightbox-image-9\" data-rl_title=\"\" data-rl_caption=\"\" title=\"\"><img loading=\"lazy\" decoding=\"async\" class=\"size-full wp-image-5768\" src=\"https:\/\/qxf2.com\/blog\/wp-content\/uploads\/2017\/05\/duplicate_records_after1.png\" alt=\"\" width=\"592\" height=\"167\" srcset=\"https:\/\/qxf2.com\/blog\/wp-content\/uploads\/2017\/05\/duplicate_records_after1.png 592w, https:\/\/qxf2.com\/blog\/wp-content\/uploads\/2017\/05\/duplicate_records_after1-300x85.png 300w\" sizes=\"auto, (max-width: 592px) 100vw, 592px\" \/><\/a><figcaption id=\"caption-attachment-5768\" class=\"wp-caption-text\">Fig 11 &#8211; Non-ASCII characters in the data<\/figcaption><\/figure>\n<p>There are multiple ways how you can handle non-ASCII characters. You can remove them, replace them with other characters, just post a warning, etc. In this example, we are removing them.<\/p>\n<pre lang=\"python\">\r\nimport pandas as pd\r\ndf = pd.read_csv(\"patient_heart_rate.csv\")\r\ndf.Firstname.replace({r'[^\\x00-\\x7F]+':''}, regex=True, inplace=True)\r\ndf.Lastname.replace({r'[^\\x00-\\x7F]+':''}, regex=True, inplace=True)\r\nprint (df)\r\n<\/pre>\n<p>After executing this code, the non-ASCII characters are replaced with blank. See Fig 12 below<\/p>\n<figure id=\"attachment_5796\" aria-describedby=\"caption-attachment-5796\" style=\"width: 589px\" class=\"wp-caption aligncenter\"><a href=\"https:\/\/qxf2.com\/blog\/wp-content\/uploads\/2017\/05\/non_ascii_after.png\" data-rel=\"lightbox-image-11\" data-rl_title=\"\" data-rl_caption=\"\" title=\"\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/qxf2.com\/blog\/wp-content\/uploads\/2017\/05\/non_ascii_after.png\" alt=\"\" width=\"589\" height=\"163\" class=\"size-full wp-image-5796\" srcset=\"https:\/\/qxf2.com\/blog\/wp-content\/uploads\/2017\/05\/non_ascii_after.png 589w, https:\/\/qxf2.com\/blog\/wp-content\/uploads\/2017\/05\/non_ascii_after-300x83.png 300w\" sizes=\"auto, (max-width: 589px) 100vw, 589px\" \/><\/a><figcaption id=\"caption-attachment-5796\" class=\"wp-caption-text\">Fig 12 \u2013 After removing non-ASCII characters<\/figcaption><\/figure>\n<hr \/>\n<p><strong>Problem 8 &#8211; Column headers are values and not variable names<\/strong><br \/>\nThe columns headers are composed of the values for Sex and Hour range variables for each individual row. Corresponding to each &#8216;m&#8217; column for males, there is also an &#8216;f&#8217; column for females, f0006, f0612 and so on as shown in the Fig 13.<br \/>\n<figure id=\"attachment_5796\" aria-describedby=\"caption-attachment-5796\" style=\"width: 589px\" class=\"wp-caption aligncenter\"><a href=\"https:\/\/qxf2.com\/blog\/wp-content\/uploads\/2017\/05\/non_ascii_after.png\" data-rel=\"lightbox-image-11\" data-rl_title=\"\" data-rl_caption=\"\" title=\"\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/qxf2.com\/blog\/wp-content\/uploads\/2017\/05\/non_ascii_after.png\" alt=\"\" width=\"589\" height=\"163\" class=\"size-full wp-image-5796\" srcset=\"https:\/\/qxf2.com\/blog\/wp-content\/uploads\/2017\/05\/non_ascii_after.png 589w, https:\/\/qxf2.com\/blog\/wp-content\/uploads\/2017\/05\/non_ascii_after-300x83.png 300w\" sizes=\"auto, (max-width: 589px) 100vw, 589px\" \/><\/a><figcaption id=\"caption-attachment-5796\" class=\"wp-caption-text\">Fig 13 \u2013 Column headers are values<\/figcaption><\/figure><\/p>\n<p>A tidy version of this dataset is one in which we need to remove the different values from the header and unpivot them into rows. Here, the Sex and Hour range would not be columns headers but rather row values in Sex and Time columns. In order to tidy this dataset, we need to &#8220;melt&#8221; the data. We\u2019ll first &#8220;melt&#8221; the Sex + Hour Range columns into a single one. The pandas library has a built-in function that allows doing just that. It \u201cunpivots\u201d a DataFrame from a wide format to a long format. Once we have that single column, we\u2019ll derive three columns from it: sex, time_lower and time_upper. With those, we\u2019ll be able to build a tidy dataset.<\/p>\n<pre lang=\"python\">#Melt the Sex + time range columns in single column\r\ndf = pd.melt(df,id_vars=['Id','Age','Weight','Firstname','Lastname'], value_name=\"PulseRate\",var_name=\"sex_and_time\").sort_values(['Id','Age','Weight','Firstname','Lastname'])\r\n\r\n# Extract Sex, Hour lower bound and Hour upper bound group\r\ntmp_df = df[\"sex_and_time\"].str.extract(\"(\\D)(\\d+)(\\d{2})\",expand=True)\r\n\r\n# Name columns\r\ntmp_df.columns = [\"Sex\", \"hours_lower\", \"hours_upper\"]\r\n\r\n# Create Time column based on \"hours_lower\" and \"hours_upper\" columns\r\ntmp_df[\"Time\"] = tmp_df[\"hours_lower\"] + \"-\" + tmp_df[\"hours_upper\"]\r\n\r\n# Merge \r\ndf = pd.concat([df, tmp_df], axis=1)\r\n\r\n# Drop unnecessary columns and rows\r\ndf = df.drop(['sex_and_time','hours_lower','hours_upper'], axis=1)\r\ndf = df.dropna()\r\ndf.to_csv('outputcleanup.csv',index=False)\r\nprint (df.head())\r\n<\/pre>\n<p>Once you execute the above code, the data looks like this<\/p>\n<p>Fig.14 shows the results of melting the patient_heart_rate dataset with row values in Sex and Time columns.<br \/>\n<figure id=\"attachment_5797\" aria-describedby=\"caption-attachment-5797\" style=\"width: 493px\" class=\"wp-caption aligncenter\"><a href=\"https:\/\/qxf2.com\/blog\/wp-content\/uploads\/2017\/05\/melt_data_sex_and_time.png\" data-rel=\"lightbox-image-13\" data-rl_title=\"\" data-rl_caption=\"\" title=\"\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/qxf2.com\/blog\/wp-content\/uploads\/2017\/05\/melt_data_sex_and_time.png\" alt=\"\" width=\"493\" height=\"357\" class=\"size-full wp-image-5797\" srcset=\"https:\/\/qxf2.com\/blog\/wp-content\/uploads\/2017\/05\/melt_data_sex_and_time.png 493w, https:\/\/qxf2.com\/blog\/wp-content\/uploads\/2017\/05\/melt_data_sex_and_time-300x217.png 300w\" sizes=\"auto, (max-width: 493px) 100vw, 493px\" \/><\/a><figcaption id=\"caption-attachment-5797\" class=\"wp-caption-text\">Fig 14 &#8211; Melting the patient_heart_rate dataset with row values in Sex and Time columns.<\/figcaption><\/figure><\/p>\n<p>Fig.15 shows the results of splitting the single column into two real variables: Time and Sex<\/p>\n<figure id=\"attachment_5798\" aria-describedby=\"caption-attachment-5798\" style=\"width: 468px\" class=\"wp-caption aligncenter\"><a href=\"https:\/\/qxf2.com\/blog\/wp-content\/uploads\/2017\/05\/tidydata_after_melting.png\" data-rel=\"lightbox-image-14\" data-rl_title=\"\" data-rl_caption=\"\" title=\"\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/qxf2.com\/blog\/wp-content\/uploads\/2017\/05\/tidydata_after_melting.png\" alt=\"\" width=\"468\" height=\"479\" class=\"size-full wp-image-5798\" srcset=\"https:\/\/qxf2.com\/blog\/wp-content\/uploads\/2017\/05\/tidydata_after_melting.png 468w, https:\/\/qxf2.com\/blog\/wp-content\/uploads\/2017\/05\/tidydata_after_melting-293x300.png 293w\" sizes=\"auto, (max-width: 468px) 100vw, 468px\" \/><\/a><figcaption id=\"caption-attachment-5798\" class=\"wp-caption-text\">Fig 15 &#8211; Tidy data after Splitting the single column into two variables: Time and Sex<\/figcaption><\/figure>\n<hr>\n<h3> Problems I have not covered<\/h3>\n<p>There were two important and common problems that I could not do justice in this already long post:<br \/>\na) handling dates<br \/>\nb) correcting character encodings (a problem you hit when you scrape data off the web)<br \/>\nI want to cover them as part of a new post on data scraping. If you hit these problems, before I publish my other post, I&#8217;d encourage you to Google about them.<\/p>\n<hr \/>\n<p>In this post, you have got the essence of tidying data using Python Pandas. Data cleaning is not a glamorous job but is critical in data-intensive applications. I hope I have convinced you, the tester, that it is not so scary. Depending on your requirements and analysis you can tidy your data with simple methods as shown in this post. <\/p>\n<p><strong>If you are a startup finding it hard to hire technical QA engineers, learn more <a href=\"https:\/\/qxf2.com\/blog\/about-qxf2\/\">about Qxf2 Services<\/a>.<\/strong><\/p>\n<hr>\n<h3>References:<\/h3>\n<p>I stongly recommend you spend some time on the references below. I learnt so much from them!<br \/>\n1. <a href=\"http:\/\/vita.had.co.nz\/papers\/tidy-data.pdf\">Tidy data<\/a>: Read this paper if you are serious about testing. The entire Qxf2 team <strong>highly, highly<\/strong> recommends it.<br \/>\n2. <a href=\"https:\/\/www.youtube.com\/watch?v=TaxJwC_MP9Q\" data-rel=\"lightbox-video-0\">Engineering data analysis<\/a>: This is an excellent talk on data and how it can be used. Talk is by the author of the tidy data paper.<br \/>\n3. <a href=\"http:\/\/www.jeannicholashould.com\/tidy-data-in-python.html\">Tidy data and Python<\/a>: The tidy data paper was written with R snippets. This post is for Pythonistas.<br \/>\n4. <a href=\"https:\/\/en.wikipedia.org\/wiki\/Data_cleansing\">Data cleansing<\/a>: Wiki has good thoughts on what are attributes of high quality data <\/p>\n<hr>\n<script>(function() {\n\twindow.mc4wp = window.mc4wp || {\n\t\tlisteners: [],\n\t\tforms: {\n\t\t\ton: function(evt, cb) {\n\t\t\t\twindow.mc4wp.listeners.push(\n\t\t\t\t\t{\n\t\t\t\t\t\tevent   : evt,\n\t\t\t\t\t\tcallback: cb\n\t\t\t\t\t}\n\t\t\t\t);\n\t\t\t}\n\t\t}\n\t}\n})();\n<\/script><!-- Mailchimp for WordPress v4.10.1 - https:\/\/wordpress.org\/plugins\/mailchimp-for-wp\/ --><form id=\"mc4wp-form-1\" class=\"mc4wp-form mc4wp-form-6165 mc4wp-form-theme mc4wp-form-theme-blue\" method=\"post\" data-id=\"6165\" data-name=\"Newsletter\" ><div class=\"mc4wp-form-fields\"><div style=\"border:3px; border-style:dashed;border-color:#56d1e1;padding:1.2em;\">\r\n  <h1 style=\"text-align: center; padding-top: 20px; padding-bottom: 20px; color: #592b1b;\">Subscribe to our weekly Newsletter<\/h1>\r\n  <input style=\"margin: auto;\" type=\"email\" name=\"EMAIL\" placeholder=\"Your email address\" required \/>\r\n  <br>\r\n  <p style=\"text-align: center;\">\r\n    <input style=\"background-color: #890c06 !important; border-color: #890c06;\" type=\"submit\" value=\"Sign up\" \/>\r\n    \r\n  <\/p>\r\n  <p style=\"text-align: center;\">\r\n    <a href=\"http:\/\/mailchi.mp\/c9c7b81ddf13\/the-informed-testers-newsletter-20-oct-2017\"><small>View a sample<\/small><\/a>\r\n  <\/p>\r\n  <br>\r\n<\/div><\/div><label style=\"display: none !important;\">Leave this field empty if you're human: <input type=\"text\" name=\"_mc4wp_honeypot\" value=\"\" tabindex=\"-1\" autocomplete=\"off\" \/><\/label><input type=\"hidden\" name=\"_mc4wp_timestamp\" value=\"1775651110\" \/><input type=\"hidden\" name=\"_mc4wp_form_id\" value=\"6165\" \/><input type=\"hidden\" name=\"_mc4wp_form_element_id\" value=\"mc4wp-form-1\" \/><div class=\"mc4wp-response\"><\/div><\/form><!-- \/ Mailchimp for WordPress Plugin -->\n<hr>\n","protected":false},"excerpt":{"rendered":"<p>I am sharing some tips and tricks on cleaning data and restructuring the data you are using for testing. Why this post? Qxf2 works with many data intensive applications. I&#8217;ve noticed a pattern &#8211; my colleagues hit very similar data related problems across many different projects. That got me thinking critically about test data. I was thrilled to stumble upon [&hellip;]<\/p>\n","protected":false},"author":16,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[131,130,132,18],"tags":[133],"class_list":["post-5635","post","type-post","status-publish","format-standard","hentry","category-data-cleaning","category-machine-learning","category-pandas","category-python","tag-data-cleanup"],"_links":{"self":[{"href":"https:\/\/qxf2.com\/blog\/wp-json\/wp\/v2\/posts\/5635","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/qxf2.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/qxf2.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/qxf2.com\/blog\/wp-json\/wp\/v2\/users\/16"}],"replies":[{"embeddable":true,"href":"https:\/\/qxf2.com\/blog\/wp-json\/wp\/v2\/comments?post=5635"}],"version-history":[{"count":102,"href":"https:\/\/qxf2.com\/blog\/wp-json\/wp\/v2\/posts\/5635\/revisions"}],"predecessor-version":[{"id":8411,"href":"https:\/\/qxf2.com\/blog\/wp-json\/wp\/v2\/posts\/5635\/revisions\/8411"}],"wp:attachment":[{"href":"https:\/\/qxf2.com\/blog\/wp-json\/wp\/v2\/media?parent=5635"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/qxf2.com\/blog\/wp-json\/wp\/v2\/categories?post=5635"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/qxf2.com\/blog\/wp-json\/wp\/v2\/tags?post=5635"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}