{"id":19490,"date":"2023-09-05T08:22:40","date_gmt":"2023-09-05T12:22:40","guid":{"rendered":"https:\/\/qxf2.com\/blog\/?p=19490"},"modified":"2023-09-05T08:22:40","modified_gmt":"2023-09-05T12:22:40","slug":"dataset-and-model-evaluation-using-deepchecks","status":"publish","type":"post","link":"https:\/\/qxf2.com\/blog\/dataset-and-model-evaluation-using-deepchecks\/","title":{"rendered":"Dataset and Model Evaluation using Deepchecks"},"content":{"rendered":"<p>At <a href=\"https:\/\/qxf2.com\/?utm_source=robustnesstesting&#038;utm_medium=click&#038;utm_campaign=From%20blog\">Qxf2<\/a>, we have always been curious on how to test and validate the datasets and models. A good machine learning team should continuously monitor the model to identify any changes in model performance. You need to be confident that your models are accurate, reliable, and fair. <a href=\"https:\/\/docs.deepchecks.com\/stable\/getting-started\/welcome.html\">Deepchecks<\/a> can help you achieve this by providing a comprehensive set of tools for validating your models and datasets. Deepchecks is a Python open-source tool used for testing and validation of machine learning models. This post outlines few steps required for thorough testing and validation of both machine learning models and datasets using Deepchecks.<\/p>\n<h3>Setup and Installation<\/h3>\n<p>Installing Deepchecks For NLP is pretty straight forward. <\/p>\n<pre lang=\"python\">\r\npip install -U deepchecks[nlp]<\/pre>\n<h3>About our Dataset and Model<\/h3>\n<p>Our dataset <em>PTO_messages.csv<\/em> is a collection of text messages for Paid Time Off (PTO) and non-PTO. The labels denote binary classification, with 0 representing not PTO and 1 representing PTO-related messages. You can find our dataset <a href=\"https:\/\/gist.github.com\/indiranell\/53def6721b4b6b139a8a5a3cab96d114\">here<\/a><\/p>\n<pre lang=\"python\">\r\ndata.head()\r\n<\/pre>\n<figure id=\"attachment_20017\" aria-describedby=\"caption-attachment-20017\" style=\"width: 492px\" class=\"wp-caption aligncenter\"><a href=\"https:\/\/qxf2.com\/blog\/wp-content\/uploads\/2023\/08\/df_head-2.png\" data-rel=\"lightbox-image-0\" data-rl_title=\"\" data-rl_caption=\"\" title=\"\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/qxf2.com\/blog\/wp-content\/uploads\/2023\/08\/df_head-2.png\" alt=\"PTO detector dataset\" width=\"492\" height=\"235\" class=\"size-full wp-image-20017\" srcset=\"https:\/\/qxf2.com\/blog\/wp-content\/uploads\/2023\/08\/df_head-2.png 492w, https:\/\/qxf2.com\/blog\/wp-content\/uploads\/2023\/08\/df_head-2-300x143.png 300w\" sizes=\"auto, (max-width: 492px) 100vw, 492px\" \/><\/a><figcaption id=\"caption-attachment-20017\" class=\"wp-caption-text\">PTO detector dataset<\/figcaption><\/figure>\n<p>For the purpose of this blog post, I have done few modifications to our Python script <em>PTO_detector.py<\/em> to train on <a href=\"https:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.ensemble.RandomForestClassifier.html\">RandomForestClassifier<\/a> model to generate outputs for various checks.<\/p>\n<p>Note: This script <em>PTO_detector.py<\/em> was originally trained on <a href=\"https:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.svm.LinearSVC.html\">LinearSVC<\/a> model to classify the PTO or non-PTO.<\/p>\n<h3>Deepchecks Suites<\/h3>\n<p>Deepchecks is bundled with a set of pre-built suites that can be used to run a set of checks on your data. There are 3 suites, the data integrity suite, the train test validation suite and the model evaluation suite. You can also run all the checks at once using the full_suite. These check sets help you make sure your data and models are reliable and accurate.<\/p>\n<h4>Data Integrity<\/h4>\n<p>This is a check for data integrity that examines the correctness of text formatting. Its purpose is to verify if the dataset is both accurate and complete.<\/p>\n<h4>Train Test Validation<\/h4>\n<p>The train-test validation suite is used to compare two datasets: the training dataset and the testing dataset. It ensures that the division between these two sets of data is accurate.<\/p>\n<h4>Model Evaluation<\/h4>\n<p>The model evaluation suite runs after a training the model and requires model predictions. This check is to evaluate the performance of a model.<\/p>\n<h3>Deepchecks data types<\/h3>\n<p>Deepcheck supports different data types commonly used in Machine Learning. <\/p>\n<li>Tabular: Handles data stored in tabular like the pandas DataFrame<\/li>\n<li>NLP: Handles Textual data<\/li>\n<li>Vision: Handles image datasets<\/li>\n<p>In this post, I will be testing our Text classification model, focusing on NLP checks and suites that are relevant for this particular scenario. Please note that our dataset is designed for binary classification data with &#8220;Text&#8221; and &#8220;Labels&#8221;. <\/p>\n<h3>Implementing Deepchecks NLP<\/h3>\n<p>Deepchecks NLP offers various data checks, such as data integrity and drift checks, which can work on any NLP task.<br \/>\nTo execute deepchecks for NLP, you&#8217;ll need to create a <a href=\"https:\/\/docs.deepchecks.com\/0.17\/nlp\/usage_guides\/text_data_object.html\">TextData<\/a> Object for both your training and testing data. The TextData Object serves as a container for your textual data, associated labels, and relevant metadata for NLP tasks.<\/p>\n<p>Firstly, import the TextData from Deepchecks<\/p>\n<pre lang=\"python\">\r\nfrom deepchecks.nlp import TextData<\/pre>\n<p>You&#8217;ll need to create a TextData object for the train and test data as shown below<\/p>\n<pre lang=\"python\">\r\ntrain = TextData(X_train, label=y_train, task_type='text_classification')\r\ntest = TextData(X_test, label=y_test, task_type='text_classification')<\/pre>\n<p>The arguments required by the TextData is the train data, label, task_type and metadata(Optional). In the above code, the train object is created using training data (X_train) and their corresponding labels (y_train), the test object is created using testing data (X_test) and their corresponding labels (y_test) with the specified task type.<\/p>\n<h3>Data Integrity Checks<\/h3>\n<p>Next, now you can run integrity suite on this TextData object:<\/p>\n<pre lang=\"python\">\r\nfrom deepchecks.nlp.suites import data_integrity\r\ndata_integrity_suite = data_integrity()\r\ndata_integrity_suite.run(train, test)\r\n<\/pre>\n<p>Deepchecks conducts data checks and outputs pass or fail based on default conditions. Now, let&#8217;s deep dive into the summary for more understanding. The summary displays few passed and failed results as shown below. Our focus will be on analysing the failed conditions and figure out the issues. <strong>Note<\/strong>: For this blog purpose, the process of cleaning data has been disabled in order to showcase the presence of unknown tokens within the output.<\/p>\n<figure id=\"attachment_19899\" aria-describedby=\"caption-attachment-19899\" style=\"width: 900px\" class=\"wp-caption aligncenter\"><a href=\"https:\/\/qxf2.com\/blog\/wp-content\/uploads\/2023\/08\/data_integrity.png\" data-rel=\"lightbox-image-1\" data-rl_title=\"\" data-rl_caption=\"\" title=\"\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/qxf2.com\/blog\/wp-content\/uploads\/2023\/08\/data_integrity-1024x452.png\" alt=\"Data Integrity checks\" width=\"900\" height=\"397\" class=\"size-large wp-image-19899\" srcset=\"https:\/\/qxf2.com\/blog\/wp-content\/uploads\/2023\/08\/data_integrity-1024x452.png 1024w, https:\/\/qxf2.com\/blog\/wp-content\/uploads\/2023\/08\/data_integrity-300x132.png 300w, https:\/\/qxf2.com\/blog\/wp-content\/uploads\/2023\/08\/data_integrity-768x339.png 768w, https:\/\/qxf2.com\/blog\/wp-content\/uploads\/2023\/08\/data_integrity.png 1206w\" sizes=\"auto, (max-width: 900px) 100vw, 900px\" \/><\/a><figcaption id=\"caption-attachment-19899\" class=\"wp-caption-text\">Data Integrity checks<\/figcaption><\/figure>\n<h5>Unknown Tokens<\/h5>\n<p>Unknown tokens are words, special characters, or emojis in the text that are not recognized by model&#8217;s tokenizer. The initial check (marked with the x in Didn&#8217;t Pass) tells us that the samples contains unexpected tokens in the dataset. Since the actual ratio is 0.03% and the check expects it to be 0%, the check has failed. This low ratio indicates that most of the text in the training dataset is already recognized and processed effectively by the tokenizer. In the instances of high ratio, you can simply pre-process the data or update your tokenizer to minimize the occurrence of unknown tokens.<\/p>\n<figure id=\"attachment_19528\" aria-describedby=\"caption-attachment-19528\" style=\"width: 900px\" class=\"wp-caption aligncenter\"><a href=\"https:\/\/qxf2.com\/blog\/wp-content\/uploads\/2023\/08\/unknown-tokens-1.png\" data-rel=\"lightbox-image-2\" data-rl_title=\"\" data-rl_caption=\"\" title=\"\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/qxf2.com\/blog\/wp-content\/uploads\/2023\/08\/unknown-tokens-1-1024x624.png\" alt=\"Unknown Tokens check\" width=\"900\" height=\"548\" class=\"size-large wp-image-19528\" srcset=\"https:\/\/qxf2.com\/blog\/wp-content\/uploads\/2023\/08\/unknown-tokens-1-1024x624.png 1024w, https:\/\/qxf2.com\/blog\/wp-content\/uploads\/2023\/08\/unknown-tokens-1-300x183.png 300w, https:\/\/qxf2.com\/blog\/wp-content\/uploads\/2023\/08\/unknown-tokens-1-768x468.png 768w, https:\/\/qxf2.com\/blog\/wp-content\/uploads\/2023\/08\/unknown-tokens-1.png 1157w\" sizes=\"auto, (max-width: 900px) 100vw, 900px\" \/><\/a><figcaption id=\"caption-attachment-19528\" class=\"wp-caption-text\">Unknown Tokens check<\/figcaption><\/figure>\n<h5>Conflicting Labels &#8211; Train Dataset and Text Dataset<\/h5>\n<p>Train Dataset and Text Dataset<\/em><\/strong> &#8211; The second check identifies identical or nearly identical samples in the dataset that have different labels. This failed case indicates the presence of conflicting labels in the dataset. This insight is valuable as it points out inconsistencies in labelling to ensure accurate model training and evaluation. The check has not passed because the percentage of samples with inconsistent labels in train dataset is 5.06%, while the Test data contains 2.56%. This exceeds the defined threshold of 0%, indicating that there are cases where the same sample possesses different labels. <\/p>\n<figure id=\"attachment_19901\" aria-describedby=\"caption-attachment-19901\" style=\"width: 900px\" class=\"wp-caption aligncenter\"><a href=\"https:\/\/qxf2.com\/blog\/wp-content\/uploads\/2023\/08\/data_integrity_conflicting_labels.png\" data-rel=\"lightbox-image-3\" data-rl_title=\"\" data-rl_caption=\"\" title=\"\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/qxf2.com\/blog\/wp-content\/uploads\/2023\/08\/data_integrity_conflicting_labels-1024x565.png\" alt=\"Data Integrity checks - Conflicting labels\" width=\"900\" height=\"497\" class=\"size-large wp-image-19901\" srcset=\"https:\/\/qxf2.com\/blog\/wp-content\/uploads\/2023\/08\/data_integrity_conflicting_labels-1024x565.png 1024w, https:\/\/qxf2.com\/blog\/wp-content\/uploads\/2023\/08\/data_integrity_conflicting_labels-300x166.png 300w, https:\/\/qxf2.com\/blog\/wp-content\/uploads\/2023\/08\/data_integrity_conflicting_labels-768x424.png 768w, https:\/\/qxf2.com\/blog\/wp-content\/uploads\/2023\/08\/data_integrity_conflicting_labels.png 1064w\" sizes=\"auto, (max-width: 900px) 100vw, 900px\" \/><\/a><figcaption id=\"caption-attachment-19901\" class=\"wp-caption-text\">Data Integrity checks &#8211; Conflicting Train labels<\/figcaption><\/figure>\n<h5>Text Duplicates &#8211; Train Dataset and Text Dataset<\/h5>\n<p>This check examines the presence of duplicate sample in both Train and Test sets. Our output shows that the Train data contains <em>24.95% duplicate data<\/em>, while the Test data contains <em>7.69% duplicate data<\/em>. These percentages exceed the threshold of 5%, alarming potential data redundancy that needs to be fixed.<\/p>\n<figure id=\"attachment_19924\" aria-describedby=\"caption-attachment-19924\" style=\"width: 936px\" class=\"wp-caption aligncenter\"><a href=\"https:\/\/qxf2.com\/blog\/wp-content\/uploads\/2023\/08\/data_integrity_duplicates_train-2.png\" data-rel=\"lightbox-image-4\" data-rl_title=\"\" data-rl_caption=\"\" title=\"\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/qxf2.com\/blog\/wp-content\/uploads\/2023\/08\/data_integrity_duplicates_train-2.png\" alt=\"Data Integrity checks - Text Duplicates\" width=\"936\" height=\"545\" class=\"size-full wp-image-19924\" srcset=\"https:\/\/qxf2.com\/blog\/wp-content\/uploads\/2023\/08\/data_integrity_duplicates_train-2.png 936w, https:\/\/qxf2.com\/blog\/wp-content\/uploads\/2023\/08\/data_integrity_duplicates_train-2-300x175.png 300w, https:\/\/qxf2.com\/blog\/wp-content\/uploads\/2023\/08\/data_integrity_duplicates_train-2-768x447.png 768w\" sizes=\"auto, (max-width: 936px) 100vw, 936px\" \/><\/a><figcaption id=\"caption-attachment-19924\" class=\"wp-caption-text\">Data Integrity checks &#8211; Text Duplicates<\/figcaption><\/figure>\n<p>Similarly you can conduct checks for Text Property Outliers, Property Label Correlation, Special Characters, Under Annotated Metadata Segments etc checks. Please refer <a href=\"https:\/\/docs.deepchecks.com\/stable\/nlp\/index.html\">Deepchecks NLP<\/a> documentation for more details. By rectifying these issues, you enhance the quality of the data.<\/p>\n<h3>Train Test Evaluation<\/h3>\n<p>Once you are confident that your data is all good to be trained, next step is to validate the split and compare train and test dataset.<\/p>\n<pre lang=\"python\">\r\nfrom deepchecks.nlp.suites import train_test_validation\r\ntrain_test_validation().run(train, test)<\/pre>\n<h5>Train Test Samples Mix<\/h5>\n<p>The output shows one failed and one passed condition in our case. This check tells us about the percentage of test data samples that also appear in the training data. The goal here is to maintain a balance and avoiding an overlap between the two sets. In our case, about 29.49% of the test data samples also appear in our training data. The desired percentage is lower than 5% and we are above that threshold. This information is very valuable for fine tuning our dataset and model&#8217;s evaluation. <\/p>\n<figure id=\"attachment_19908\" aria-describedby=\"caption-attachment-19908\" style=\"width: 961px\" class=\"wp-caption aligncenter\"><a href=\"https:\/\/qxf2.com\/blog\/wp-content\/uploads\/2023\/08\/train_test_evaluation.png\" data-rel=\"lightbox-image-5\" data-rl_title=\"\" data-rl_caption=\"\" title=\"\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/qxf2.com\/blog\/wp-content\/uploads\/2023\/08\/train_test_evaluation.png\" alt=\"Train Test Evaluation\" width=\"961\" height=\"745\" class=\"size-full wp-image-19908\" srcset=\"https:\/\/qxf2.com\/blog\/wp-content\/uploads\/2023\/08\/train_test_evaluation.png 961w, https:\/\/qxf2.com\/blog\/wp-content\/uploads\/2023\/08\/train_test_evaluation-300x233.png 300w, https:\/\/qxf2.com\/blog\/wp-content\/uploads\/2023\/08\/train_test_evaluation-768x595.png 768w\" sizes=\"auto, (max-width: 961px) 100vw, 961px\" \/><\/a><figcaption id=\"caption-attachment-19908\" class=\"wp-caption-text\">Train Test Evaluation<\/figcaption><\/figure>\n<p>Upon closer examination specific test samples were found to have corresponding instances within the training set(46 out of 156). To address this you can apply different split techniques suitable for your requirements to reduce the data overlap. <\/p>\n<h5>Label Drift<\/h5>\n<p>We also have a passed status for the &#8220;Label Drift&#8221; check. The Label Drift check acts as a measure to gauge the dissimilarity between the distributions of labels in the train and test datasets. It calculates the difference in label distributions between these datasets. The &#8220;Passed&#8221; status signifies that the label drift is within an acceptable range, with a minimal score of 0.15.<\/p>\n<figure id=\"attachment_19545\" aria-describedby=\"caption-attachment-19545\" style=\"width: 917px\" class=\"wp-caption aligncenter\"><a href=\"https:\/\/qxf2.com\/blog\/wp-content\/uploads\/2023\/08\/train-test-suite-label-drift.png\" data-rel=\"lightbox-image-6\" data-rl_title=\"\" data-rl_caption=\"\" title=\"\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/qxf2.com\/blog\/wp-content\/uploads\/2023\/08\/train-test-suite-label-drift.png\" alt=\"Train Test Evaluation - Label Drift\" width=\"917\" height=\"780\" class=\"size-full wp-image-19545\" srcset=\"https:\/\/qxf2.com\/blog\/wp-content\/uploads\/2023\/08\/train-test-suite-label-drift.png 917w, https:\/\/qxf2.com\/blog\/wp-content\/uploads\/2023\/08\/train-test-suite-label-drift-300x255.png 300w, https:\/\/qxf2.com\/blog\/wp-content\/uploads\/2023\/08\/train-test-suite-label-drift-768x653.png 768w\" sizes=\"auto, (max-width: 917px) 100vw, 917px\" \/><\/a><figcaption id=\"caption-attachment-19545\" class=\"wp-caption-text\">Train Test Evaluation &#8211; Label Drift<\/figcaption><\/figure>\n<p>This check also provides insights into the distributions, showing the top 10 categories with the most significant differences between the two datasets. Similarly, you an conduct Embedding Drift, NLP Property Drift checks. Please refer <a href=\"https:\/\/docs.deepchecks.com\/stable\/nlp\/index.html\">Deepchecks NLP<\/a> documentation for more details<\/p>\n<h3>Model Evaluation<\/h3>\n<p>The model evaluation suite, is designed to be run after a model has been trained and requires model predictions and probabilities which can be supplied via the arguments in the run function.<\/p>\n<pre lang=\"python\">\r\nfrom deepchecks.nlp.suites import model_evaluation\r\nmodel_evaluation().run(train, test,    \r\n                                train_predictions=train_preds,\r\n                                test_predictions=test_preds,\r\n                                train_probabilities=train_probs,\r\n                                test_probabilities=test_probs)\r\n<\/pre>\n<h5>Train Test Performance<\/h5>\n<p>The condition &#8220;Train Test Performance&#8221; failed in the output. The train test degradation score is a measure of how much a model&#8217;s performance drops when moving from the training dataset to the testing dataset. It&#8217;s an important metric to evaluate how well a model performs on the unseen data.<\/p>\n<figure id=\"attachment_19910\" aria-describedby=\"caption-attachment-19910\" style=\"width: 906px\" class=\"wp-caption aligncenter\"><a href=\"https:\/\/qxf2.com\/blog\/wp-content\/uploads\/2023\/08\/model_evaluation.png\" data-rel=\"lightbox-image-7\" data-rl_title=\"\" data-rl_caption=\"\" title=\"\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/qxf2.com\/blog\/wp-content\/uploads\/2023\/08\/model_evaluation.png\" alt=\"Model Evaluation - Train Test Performance\" width=\"906\" height=\"779\" class=\"size-full wp-image-19910\" srcset=\"https:\/\/qxf2.com\/blog\/wp-content\/uploads\/2023\/08\/model_evaluation.png 906w, https:\/\/qxf2.com\/blog\/wp-content\/uploads\/2023\/08\/model_evaluation-300x258.png 300w, https:\/\/qxf2.com\/blog\/wp-content\/uploads\/2023\/08\/model_evaluation-768x660.png 768w\" sizes=\"auto, (max-width: 906px) 100vw, 906px\" \/><\/a><figcaption id=\"caption-attachment-19910\" class=\"wp-caption-text\">Model Evaluation &#8211; Train Test Performance<\/figcaption><\/figure>\n<p>The output also highlights that there are 2 scores that failed this degradation test. The most significant degradation observed is for the &#8220;Recall&#8221; metric, specifically for class 1 with a maximum degradation of 12.28%. It signifies a notable drop in the model&#8217;s ability to correctly identify instances belonging to class 1 when applied to new or unseen data. This could imply that the model is not maintaining consistent performance levels across training and testing data.<\/p>\n<h5>Prediction Drift<\/h5>\n<p>The condition for &#8220;Prediction Drift&#8221; passed in the output. It is indicating the outcome of monitoring the model for prediction drift using the <a href=\"https:\/\/en.wikipedia.org\/wiki\/Kolmogorov%E2%80%93Smirnov_test\">Kolmogorov-Smirnov (KS)<\/a> drift score. The KS drift score reported is 0.14. Even though the KS drift score of 0.14 is slightly lower than the specified threshold 0.15, it is important to analyze the source of this drift. Compare the distribution of predictions at various time points to pinpoint when the drift started occurring. Consider retraining or fine-tuning the model to ensure that the model remains effective and reliable in its predictions over time.<br \/>\n<figure id=\"attachment_19911\" aria-describedby=\"caption-attachment-19911\" style=\"width: 925px\" class=\"wp-caption aligncenter\"><a href=\"https:\/\/qxf2.com\/blog\/wp-content\/uploads\/2023\/08\/model_evaluation_prediction_drift.png\" data-rel=\"lightbox-image-8\" data-rl_title=\"\" data-rl_caption=\"\" title=\"\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/qxf2.com\/blog\/wp-content\/uploads\/2023\/08\/model_evaluation_prediction_drift.png\" alt=\"Model Evaluation \u2013 Prediction Drift\" width=\"925\" height=\"677\" class=\"size-full wp-image-19911\" srcset=\"https:\/\/qxf2.com\/blog\/wp-content\/uploads\/2023\/08\/model_evaluation_prediction_drift.png 925w, https:\/\/qxf2.com\/blog\/wp-content\/uploads\/2023\/08\/model_evaluation_prediction_drift-300x220.png 300w, https:\/\/qxf2.com\/blog\/wp-content\/uploads\/2023\/08\/model_evaluation_prediction_drift-768x562.png 768w\" sizes=\"auto, (max-width: 925px) 100vw, 925px\" \/><\/a><figcaption id=\"caption-attachment-19911\" class=\"wp-caption-text\">Model Evaluation \u2013 Prediction Drift<\/figcaption><\/figure><\/p>\n<h3>Full Suite<\/h3>\n<p>Deepchecks gives a quick overview of your model and data with Full Suite that includes many of the implemented checks. full_suite is a collection of the prebuilt checks.<br \/>\nTo import the full_suite method, use the code below<\/p>\n<pre lang=\"python\">\r\nfrom deepchecks.nlp.suites import full_suite\r\nsuite = full_suite()\r\nsuite.run(train_dataset=train,\r\n    test_dataset=test,\r\n    with_display=True,\r\n    train_predictions=train_preds,\r\n    test_predictions=test_preds,\r\n    train_probabilities=train_probs,\r\n    test_probabilities=test_probs)\r\n<\/pre>\n<p>This will perform full checks on the dataset objects and the model and generates a consolidated output with Passed\/Didn&#8217;t Pass.<\/p>\n<p>You can find the final code <a href=\"https:\/\/gist.github.com\/indiranell\/dbd94c4bfa6fa50b83199914c6996660\">here<\/a>. <\/p>\n<p>Using Deepchecks, helps us check and fine tune our models and data. It&#8217;s user-friendly and offers different tests to catch issues early, making our models work better. Happy Testing!!<\/p>\n<hr\/>\n<h4>References<\/h4>\n<p>1. <a href=\"https:\/\/docs.deepchecks.com\/0.17\/nlp\/index.html\">Deepchecks Documentation<\/a><br \/>\n2. <a href=\"https:\/\/medium.com\/@noamzbr\/deepchecks-nlp-ml-validation-for-text-made-easy-40aaa8a95c15\">Deepchecks NLP: ML Validation for Text Made Easy<\/a><br \/>\n3. <a href=\"https:\/\/domino.ai\/blog\/high-standard-ml-validation-with-deepchecks\">High-standard ML validation with Deepchecks<\/a><\/p>\n<hr>\n<h4>Hire technical testers from Qxf2<\/h4>\n<p>Qxf2 offers valuable expertise in testing Machine Learning projects. Our team goes beyond traditional QA to ensure data quality, model accuracy, and system robustness. With us, you&#8217;ll have skilled testers who understand the nuances of ML and can enhance the reliability of your projects. <a href=\"https:\/\/qxf2.com\/contact?utm_source=dataset_and_model_evaluation_using_deepchecks&#038;utm_medium=click&#038;utm_campaign=From%20blog\">Get in touch<\/a> with us!<\/p>\n<hr>\n","protected":false},"excerpt":{"rendered":"<p>At Qxf2, we have always been curious on how to test and validate the datasets and models. A good machine learning team should continuously monitor the model to identify any changes in model performance. You need to be confident that your models are accurate, reliable, and fair. Deepchecks can help you achieve this by providing a comprehensive set of tools [&hellip;]<\/p>\n","protected":false},"author":16,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[379,387,385,386],"tags":[],"class_list":["post-19490","post","type-post","status-publish","format-standard","hentry","category-deepchecks","category-model-evaluation-deepchecks","category-nlp","category-train-test-split"],"_links":{"self":[{"href":"https:\/\/qxf2.com\/blog\/wp-json\/wp\/v2\/posts\/19490","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/qxf2.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/qxf2.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/qxf2.com\/blog\/wp-json\/wp\/v2\/users\/16"}],"replies":[{"embeddable":true,"href":"https:\/\/qxf2.com\/blog\/wp-json\/wp\/v2\/comments?post=19490"}],"version-history":[{"count":143,"href":"https:\/\/qxf2.com\/blog\/wp-json\/wp\/v2\/posts\/19490\/revisions"}],"predecessor-version":[{"id":20040,"href":"https:\/\/qxf2.com\/blog\/wp-json\/wp\/v2\/posts\/19490\/revisions\/20040"}],"wp:attachment":[{"href":"https:\/\/qxf2.com\/blog\/wp-json\/wp\/v2\/media?parent=19490"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/qxf2.com\/blog\/wp-json\/wp\/v2\/categories?post=19490"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/qxf2.com\/blog\/wp-json\/wp\/v2\/tags?post=19490"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}