{"id":6736,"date":"2017-10-09T07:34:29","date_gmt":"2017-10-09T11:34:29","guid":{"rendered":"https:\/\/qxf2.com\/blog\/?p=6736"},"modified":"2018-04-02T10:21:30","modified_gmt":"2018-04-02T14:21:30","slug":"quilt-data-package-manager","status":"publish","type":"post","link":"https:\/\/qxf2.com\/blog\/quilt-data-package-manager\/","title":{"rendered":"Quilt &#8211; a Data Package Manager"},"content":{"rendered":"<p>We have been testing data-rich applications for a long time. And like any experienced tester, we realize how difficult it is to create, maintain and update data every time the data model changes. So we were excited to come across <a href=\"https:\/\/quiltdata.com\/\">Quilt<\/a>, a data package manager, via <a href=\"https:\/\/news.ycombinator.com\">Hacker News<\/a>. We were thrilled that it integrated well with our favorite programming language &#8211; Python. So we set about exploring Quilt to see if it could be &#8216;GitHub for data&#8217;. In this post, we will talk about basic operations that we performed on data and how we could access the older data versions using Quilt. We will write a follow up soon with a step by step guide on how to integrate Quilt with your test data.<\/p>\n<p><strong>Note:<\/strong> Quilt talks about storing binary data and therefore not duplicating too much data. As testers, we don&#8217;t (yet) care too much about this cool feature. But we are sure if you are a developer or data analyst you will care about this feature.<\/p>\n<hr\/>\n<h3>The need for data versioning<\/h3>\n<p>We like creating our own data for testing data-heavy applications. This presents a few problems<\/p>\n<p>1. It is hard to maintain test data when the underlying data model changes. Over time, in any data-rich application, the data model undergoes changes. For example, new fields are added, existing fields get deleted, split or combined. All this means that the test data needs to evolve. We end up writing separate scripts to create data and then to maintain the data. <\/p>\n<p>2 It is very difficult to intelligently name files. As a tester, our instinct is not to design and use databases for our data. Instead, we try to rely on flat files. So if you are like us, you end up a few dozen (or hundred) files named intelligently. To us, this approach has always felt like the poor man&#8217;s version control.<\/p>\n<p>Data versioning can solve both these problems. In many cases, simply using git (or your version control tool) is not an option when you have large amounts of data. So we are glad to find version control that is specifically designed for data. <\/p>\n<p>What we especially liked about Quilt, was that it integrates so well with Python. Quilt uses Python pandas&#8217; DataFrame as the default data structure. This makes reading and (especially) editing data so easy! If you have ever tried editing specific columns of a large csv (yeah, we know about <a href=\"https:\/\/docs.python.org\/2\/library\/fileinput.html\">fileinput<\/a>) you will know why we are so happy.<\/p>\n<hr\/>\n<h3>Quilt<\/h3>\n<p>Quilt is a data package manager which is a versioned bundle of serialized data wrapped in a Python module. Phew! That was a mouthful! We are new to Quilt, so our understanding may be wrong. But here is our mental model for Quilt. Quilt takes your data and:<br \/>\na) converts it into a special format (serialized binary)<br \/>\nb) magically transforms your data into a Python module (wrapped in a Python module)<br \/>\nc) lets you commit\/store the data in a version control system (versioned)<\/p>\n<p>Once you run your dataset (or a portion of your dataset) through Quilt, you get what is called a &#8216;data package&#8217;. A data package is an abstraction that encapsulates and automates data preparation. By packaging the dataset, you can easily reuse it and manage the different versions. Quilt comes with a command-line utility that builds, pushes, and installs data packages. In this post, we will be covering below areas and show different operations which we have tried using Quilt.<\/p>\n<ul>\n<li>Use an existing Quilt package<\/li>\n<li>Create a new Quilt packages<\/li>\n<li>Managing versions of data<\/li>\n<li>Edit package contents<\/li>\n<\/ul>\n<hr\/>\n<h3>1. Use an existing Quilt package<\/h3>\n<p>We can either install the already existing published packages from the <a href=\"https:\/\/quiltdata.com\">Quilt<\/a> website or we can create a new dataset based on the testing requirement and make it public. Firstly, let us understand how to use the existing package.<\/p>\n<p><strong>Quilt Installation<\/strong> &#8211; Quilt can be installed with <em>pip<\/em> using command <\/p>\n<pre lang=\"Python\">pip install quilt <\/pre>\n<p><strong>Package list<\/strong> &#8211;  Already existing published packages reside <a href=\"https:\/\/quiltdata.com\">here<\/a>. <\/p>\n<p><strong>Dataset Installation<\/strong> &#8211; Every Quilt command is available both on the command line and in Python. So installing and downloading a data package is very simple.  For eg., for downloading a data package for iris dataset from uciml user, we can use below command<\/p>\n<pre lang=\"Python\">import quilt\r\n\r\nquilt.install(\"uciml\/iris\")<\/pre>\n<p>Packages are installed in the current directory in folder named <em>quilt_packages<\/em>. Now we can load the data directly into Python using import command. You can import the package just like any other python package. You can edit package contents using the Pandas to edit existing dataframe. Below figure shows how to install and import existing quilt packages.<\/p>\n<figure id=\"attachment_6856\" aria-describedby=\"caption-attachment-6856\" style=\"width: 960px\" class=\"wp-caption aligncenter\"><a href=\"https:\/\/qxf2.com\/blog\/wp-content\/uploads\/2017\/09\/existing-pacakge.png\" data-rel=\"lightbox-image-0\" data-rl_title=\"\" data-rl_caption=\"\" title=\"\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/qxf2.com\/blog\/wp-content\/uploads\/2017\/09\/existing-pacakge.png\" alt=\"\" width=\"960\" height=\"398\" class=\"size-full wp-image-6856\" srcset=\"https:\/\/qxf2.com\/blog\/wp-content\/uploads\/2017\/09\/existing-pacakge.png 960w, https:\/\/qxf2.com\/blog\/wp-content\/uploads\/2017\/09\/existing-pacakge-300x124.png 300w, https:\/\/qxf2.com\/blog\/wp-content\/uploads\/2017\/09\/existing-pacakge-768x318.png 768w\" sizes=\"auto, (max-width: 960px) 100vw, 960px\" \/><\/a><figcaption id=\"caption-attachment-6856\" class=\"wp-caption-text\">Import the existing quilt packages<\/figcaption><\/figure>\n<p>Packages contain three types of nodes: <\/p>\n<ul>\n<li><em>PackageNode <\/em>&#8211; the root of the package tree<\/li>\n<li><em>GroupNode <\/em>&#8211; like a folder; may contain one or more GroupNode or DataNode objects<\/li>\n<li><em>DataNode <\/em>&#8211; a leaf node in the package; contains actual data<\/li>\n<\/ul>\n<hr\/>\n<h3>2. Create a new Quilt package<\/h3>\n<p>For this post, we have created a simple dataset. So we will see how to convert a data file into a data package using a configuration file, conventionally called build.yml. The build.yml file tells Quilt how to structure a package. quilt <em><strong>generate<\/strong><\/em> automatically creates a build file that mirrors the contents of any directory.<\/p>\n<pre lang=\"Python\">quilt generate sourcedata<\/pre>\n<pre lang=\"Python\">\r\ncontents:\r\n  README:\r\n    file: README.md\r\n  quilt_sample_package:\r\n    file: quilt_sample.csv\r\n<\/pre>\n<p>In the build.yml, the quilt_sample_package is the package name that package users will type to access the data extracted from the CSV file. You can any time edit build.yml to make the data name or package name easier to understand. Each Quilt package has a unique handle of the form USER_NAME\/PACKAGE_NAME. Package life cycle consists of core commands like build, push, log, install. To use a data package you import it.<\/p>\n<p>The command <strong>quilt <em>build<\/em><\/strong> creates a package. Quilt uses pandas to parse tabular file formats (xls, csv, tsv, etc.) into dataframes and <a href=\"https:\/\/arrow.apache.org\/docs\/python\/install.html\">pyarrow<\/a> to serialize data frames to <a href=\"https:\/\/parquet.apache.org\/\">Parquet<\/a> format. Below is the command for using quilt build<\/p>\n<pre lang=\"Python\">quilt build indira\/quilt_sample_package sourcedata\/build.yml<\/pre>\n<figure id=\"attachment_6839\" aria-describedby=\"caption-attachment-6839\" style=\"width: 967px\" class=\"wp-caption aligncenter\"><a href=\"https:\/\/qxf2.com\/blog\/wp-content\/uploads\/2017\/08\/buildpic.png\" data-rel=\"lightbox-image-1\" data-rl_title=\"\" data-rl_caption=\"\" title=\"\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/qxf2.com\/blog\/wp-content\/uploads\/2017\/08\/buildpic.png\" alt=\"\" width=\"967\" height=\"129\" class=\"size-full wp-image-6839\" srcset=\"https:\/\/qxf2.com\/blog\/wp-content\/uploads\/2017\/08\/buildpic.png 967w, https:\/\/qxf2.com\/blog\/wp-content\/uploads\/2017\/08\/buildpic-300x40.png 300w, https:\/\/qxf2.com\/blog\/wp-content\/uploads\/2017\/08\/buildpic-768x102.png 768w\" sizes=\"auto, (max-width: 967px) 100vw, 967px\" \/><\/a><figcaption id=\"caption-attachment-6839\" class=\"wp-caption-text\">Build a quilt package<\/figcaption><\/figure>\n<p>The command <strong>quilt <em>push<\/em><\/strong> stores a package in a server-side registry for anyone who needs it. You need to be a registered user of Quilt (free tier available) to be able to push and store the package.<\/p>\n<pre lang=\"Python\">\r\nquilt login\r\n\r\nquilt push --public indira\/quilt_sample_package\r\n<\/pre>\n<figure id=\"attachment_6840\" aria-describedby=\"caption-attachment-6840\" style=\"width: 961px\" class=\"wp-caption aligncenter\"><a href=\"https:\/\/qxf2.com\/blog\/wp-content\/uploads\/2017\/08\/push.png\" data-rel=\"lightbox-image-2\" data-rl_title=\"\" data-rl_caption=\"\" title=\"\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/qxf2.com\/blog\/wp-content\/uploads\/2017\/08\/push.png\" alt=\"\" width=\"961\" height=\"156\" class=\"size-full wp-image-6840\" srcset=\"https:\/\/qxf2.com\/blog\/wp-content\/uploads\/2017\/08\/push.png 961w, https:\/\/qxf2.com\/blog\/wp-content\/uploads\/2017\/08\/push-300x49.png 300w, https:\/\/qxf2.com\/blog\/wp-content\/uploads\/2017\/08\/push-768x125.png 768w\" sizes=\"auto, (max-width: 961px) 100vw, 961px\" \/><\/a><figcaption id=\"caption-attachment-6840\" class=\"wp-caption-text\">Push to Quilt Registry<\/figcaption><\/figure>\n<p>The package now resides in the registry and has a landing page populated by sourcedata\/README.md. Here is the <a href=\"https:\/\/quiltdata.com\/package\/indira\/quilt_sample_package\">link<\/a> to the landing page that is created for this package. You can omit the &#8211;public flag to create private packages.<\/p>\n<hr\/>\n<h3>3. Managing versions of data<\/h3>\n<p>The command <em>quilt <strong>log<\/strong><\/em> tracks changes over time. Whenever a user changes the data for a particular requirement, the changes are tracked in the log history as shown below. Build and Push commands need to be executed for any data changes. <\/p>\n<figure id=\"attachment_6850\" aria-describedby=\"caption-attachment-6850\" style=\"width: 750px\" class=\"wp-caption aligncenter\"><a href=\"https:\/\/qxf2.com\/blog\/wp-content\/uploads\/2017\/09\/log2.png\" data-rel=\"lightbox-image-3\" data-rl_title=\"\" data-rl_caption=\"\" title=\"\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/qxf2.com\/blog\/wp-content\/uploads\/2017\/09\/log2.png\" alt=\"\" width=\"750\" height=\"93\" class=\"size-full wp-image-6850\" srcset=\"https:\/\/qxf2.com\/blog\/wp-content\/uploads\/2017\/09\/log2.png 750w, https:\/\/qxf2.com\/blog\/wp-content\/uploads\/2017\/09\/log2-300x37.png 300w\" sizes=\"auto, (max-width: 750px) 100vw, 750px\" \/><\/a><figcaption id=\"caption-attachment-6850\" class=\"wp-caption-text\">Quilt Log History for tracking changes<\/figcaption><\/figure>\n<p>Finally, the command  <em>quilt <strong>install <\/strong>-x<\/em> allows us to install historical snapshots. In our case, we wanted an older version of data, so using quilt install we could retrieve the older version data as shown below<\/p>\n<pre lang=\"python\">quilt install -x OLD_HASH<\/pre>\n<figure id=\"attachment_6851\" aria-describedby=\"caption-attachment-6851\" style=\"width: 962px\" class=\"wp-caption aligncenter\"><a href=\"https:\/\/qxf2.com\/blog\/wp-content\/uploads\/2017\/09\/install.png\" data-rel=\"lightbox-image-4\" data-rl_title=\"\" data-rl_caption=\"\" title=\"\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/qxf2.com\/blog\/wp-content\/uploads\/2017\/09\/install.png\" alt=\"\" width=\"962\" height=\"136\" class=\"size-full wp-image-6851\" srcset=\"https:\/\/qxf2.com\/blog\/wp-content\/uploads\/2017\/09\/install.png 962w, https:\/\/qxf2.com\/blog\/wp-content\/uploads\/2017\/09\/install-300x42.png 300w, https:\/\/qxf2.com\/blog\/wp-content\/uploads\/2017\/09\/install-768x109.png 768w\" sizes=\"auto, (max-width: 962px) 100vw, 962px\" \/><\/a><figcaption id=\"caption-attachment-6851\" class=\"wp-caption-text\">Quilt install to retrieve the older version of data<\/figcaption><\/figure>\n<p>We observed a couple of things here, when we did a quilt install using the OLD HASH tag, we saw a message saying that &#8216;Fragment already installed; skipping.&#8217; We noticed that the changes made to the data are not reflected. This &#8220;skipping&#8221; message means that one or more data fragments haven&#8217;t changed. The version will still be correct and still gets installed. Currently, Quilt doesn&#8217;t delete any no-longer-used data. So, when we build the first version of the .csv, Quilt generates a package. Then we build the second version, but the data from the first version is still around (think of it as a cache). That is why, when we &#8220;quilt <em>install<\/em>&#8221; a different version, it says that we already have the data. Any changes made to the data are actually seen only when we actually import the package. When you see different fragment hashes when you install different versions, it confirms that the data is different. When you run &#8220;quilt <em>install<\/em>&#8220;, you should see something like this:<\/p>\n<p><a href=\"https:\/\/qxf2.com\/blog\/wp-content\/uploads\/2017\/09\/data-fragment.png\" data-rel=\"lightbox-image-5\" data-rl_title=\"\" data-rl_caption=\"\" title=\"\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/qxf2.com\/blog\/wp-content\/uploads\/2017\/09\/data-fragment.png\" alt=\"\" width=\"711\" height=\"71\" class=\"aligncenter size-full wp-image-6877\" srcset=\"https:\/\/qxf2.com\/blog\/wp-content\/uploads\/2017\/09\/data-fragment.png 711w, https:\/\/qxf2.com\/blog\/wp-content\/uploads\/2017\/09\/data-fragment-300x30.png 300w\" sizes=\"auto, (max-width: 711px) 100vw, 711px\" \/><\/a><\/p>\n<p>When we try to install a different version, we should see a different hash in the \u201cDownloading\u201d message. (If you don\u2019t, then something is in fact wrong.). <\/p>\n<hr\/>\n<h3>4. Edit package contents<\/h3>\n<p>Data packages are like folders containing dataframes. So we can use Pandas API to edit the dataframes. With Python\u2019s dot operator you can traverse a data package as shown below.<\/p>\n<pre lang='python'>\r\nfrom quilt.data.indira import quilt_sample_package \r\n\r\ndf =  quilt_sample_package._data()\r\n\r\nprint (df)\r\n<\/pre>\n<p>The <em>_data()<\/em> method caches the dataframe so it will return the same object each time. We can use <em>set_value<\/em> command to set the values to the data and make changes to the data as shown below<\/p>\n<pre lang='python'>\r\ndf.set_value(0, 'radius', 15)<\/pre>\n<p>In the above code, we modified a value in the &#8216;radius&#8217; column.  Similarly, we can use <em>_set<\/em> method on the top-level package node to create new groups and data nodes as shown below.<\/p>\n<pre lang='python'>\r\n # Add a new dataframe\r\n df = pd.DataFrame(dict(a=[1, 2, 3]))\r\n quilt_sample_package._set(['test', 'df'], df)\r\n\r\n # Add a new group\r\n quilt_sample_package._add_group(\"testgroup\")\r\n quilt_sample_package._set(['test', 'testgroup', 'df'], df)\r\n<\/pre>\n<p>Once the changes are made to the data, at this point the package owner need to build and push to update the package and verify the new contents.<\/p>\n<pre lang='python'>\r\nquilt.build(\"indira\/quilt_sample_package\",quilt_sample_package)\r\n\r\nquilt.push(\"indira\/quilt_sample_package\")<\/pre>\n<hr\/>\n<p>We will follow this post up with one more post with step by step guide and a hands-on example on how to use Quilt. Stay tuned!<\/p>\n<p><strong>If you are a startup finding it hard to hire technical QA engineers, learn more <a href=\"https:\/\/qxf2.com\/blog\/about-qxf2\/\">about Qxf2 Services<\/a>.<\/strong><\/p>\n<hr\/>\n<h3>References<\/h3>\n<p>1. <a href=\"https:\/\/docs.quiltdata.com\/python.html\">Introduction to Quilt<\/a><br \/>\n2. <a href=\"https:\/\/blog.quiltdata.com\/data-packages-for-fast-reproducible-python-analysis-c74b78015c7f\">Data-packages-for-fast-reproducible-python-analysis<\/a><br \/>\n3. <a href=\"https:\/\/blog.quiltdata.com\/its-time-to-manage-data-like-source-code-3df04cd312b8\">Manage-data-like-source-code<\/a><\/p>\n<hr>\n","protected":false},"excerpt":{"rendered":"<p>We have been testing data-rich applications for a long time. And like any experienced tester, we realize how difficult it is to create, maintain and update data every time the data model changes. So we were excited to come across Quilt, a data package manager, via Hacker News. We were thrilled that it integrated well with our favorite programming language [&hellip;]<\/p>\n","protected":false},"author":16,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[148,141,130,18,147],"tags":[],"class_list":["post-6736","post","type-post","status-publish","format-standard","hentry","category-data-versioning","category-extracting-data","category-machine-learning","category-python","category-quilt"],"_links":{"self":[{"href":"https:\/\/qxf2.com\/blog\/wp-json\/wp\/v2\/posts\/6736","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/qxf2.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/qxf2.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/qxf2.com\/blog\/wp-json\/wp\/v2\/users\/16"}],"replies":[{"embeddable":true,"href":"https:\/\/qxf2.com\/blog\/wp-json\/wp\/v2\/comments?post=6736"}],"version-history":[{"count":69,"href":"https:\/\/qxf2.com\/blog\/wp-json\/wp\/v2\/posts\/6736\/revisions"}],"predecessor-version":[{"id":7135,"href":"https:\/\/qxf2.com\/blog\/wp-json\/wp\/v2\/posts\/6736\/revisions\/7135"}],"wp:attachment":[{"href":"https:\/\/qxf2.com\/blog\/wp-json\/wp\/v2\/media?parent=6736"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/qxf2.com\/blog\/wp-json\/wp\/v2\/categories?post=6736"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/qxf2.com\/blog\/wp-json\/wp\/v2\/tags?post=6736"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}