Dataset and Model Evaluation using Deepchecks

At Qxf2, we have always been curious on how to test and validate the datasets and models. A good machine learning team should continuously monitor the model to identify any changes in model performance. You need to be confident that your models are accurate, reliable, and fair. Deepchecks can help you achieve this by providing a comprehensive set of tools for validating your models and datasets. Deepchecks is a Python open-source tool used for testing and validation of machine learning models. This post outlines few steps required for thorough testing and validation of both machine learning models and datasets using Deepchecks.

Setup and Installation

Installing Deepchecks For NLP is pretty straight forward.

pip install -U deepchecks[nlp]

About our Dataset and Model

Our dataset PTO_messages.csv is a collection of text messages for Paid Time Off (PTO) and non-PTO. The labels denote binary classification, with 0 representing not PTO and 1 representing PTO-related messages. You can find our dataset here

PTO detector dataset
PTO detector dataset

For the purpose of this blog post, I have done few modifications to our Python script to train on RandomForestClassifier model to generate outputs for various checks.

Note: This script was originally trained on LinearSVC model to classify the PTO or non-PTO.

Deepchecks Suites

Deepchecks is bundled with a set of pre-built suites that can be used to run a set of checks on your data. There are 3 suites, the data integrity suite, the train test validation suite and the model evaluation suite. You can also run all the checks at once using the full_suite. These check sets help you make sure your data and models are reliable and accurate.

Data Integrity

This is a check for data integrity that examines the correctness of text formatting. Its purpose is to verify if the dataset is both accurate and complete.

Train Test Validation

The train-test validation suite is used to compare two datasets: the training dataset and the testing dataset. It ensures that the division between these two sets of data is accurate.

Model Evaluation

The model evaluation suite runs after a training the model and requires model predictions. This check is to evaluate the performance of a model.

Deepchecks data types

Deepcheck supports different data types commonly used in Machine Learning.

  • Tabular: Handles data stored in tabular like the pandas DataFrame
  • NLP: Handles Textual data
  • Vision: Handles image datasets
  • In this post, I will be testing our Text classification model, focusing on NLP checks and suites that are relevant for this particular scenario. Please note that our dataset is designed for binary classification data with “Text” and “Labels”.

    Implementing Deepchecks NLP

    Deepchecks NLP offers various data checks, such as data integrity and drift checks, which can work on any NLP task.
    To execute deepchecks for NLP, you’ll need to create a TextData Object for both your training and testing data. The TextData Object serves as a container for your textual data, associated labels, and relevant metadata for NLP tasks.

    Firstly, import the TextData from Deepchecks

    from deepchecks.nlp import TextData

    You’ll need to create a TextData object for the train and test data as shown below

    train = TextData(X_train, label=y_train, task_type='text_classification')
    test = TextData(X_test, label=y_test, task_type='text_classification')

    The arguments required by the TextData is the train data, label, task_type and metadata(Optional). In the above code, the train object is created using training data (X_train) and their corresponding labels (y_train), the test object is created using testing data (X_test) and their corresponding labels (y_test) with the specified task type.

    Data Integrity Checks

    Next, now you can run integrity suite on this TextData object:

    from deepchecks.nlp.suites import data_integrity
    data_integrity_suite = data_integrity(), test)

    Deepchecks conducts data checks and outputs pass or fail based on default conditions. Now, let’s deep dive into the summary for more understanding. The summary displays few passed and failed results as shown below. Our focus will be on analysing the failed conditions and figure out the issues. Note: For this blog purpose, the process of cleaning data has been disabled in order to showcase the presence of unknown tokens within the output.

    Data Integrity checks
    Data Integrity checks
    Unknown Tokens

    Unknown tokens are words, special characters, or emojis in the text that are not recognized by model’s tokenizer. The initial check (marked with the x in Didn’t Pass) tells us that the samples contains unexpected tokens in the dataset. Since the actual ratio is 0.03% and the check expects it to be 0%, the check has failed. This low ratio indicates that most of the text in the training dataset is already recognized and processed effectively by the tokenizer. In the instances of high ratio, you can simply pre-process the data or update your tokenizer to minimize the occurrence of unknown tokens.

    Unknown Tokens check
    Unknown Tokens check
    Conflicting Labels – Train Dataset and Text Dataset

    Train Dataset and Text Dataset – The second check identifies identical or nearly identical samples in the dataset that have different labels. This failed case indicates the presence of conflicting labels in the dataset. This insight is valuable as it points out inconsistencies in labelling to ensure accurate model training and evaluation. The check has not passed because the percentage of samples with inconsistent labels in train dataset is 5.06%, while the Test data contains 2.56%. This exceeds the defined threshold of 0%, indicating that there are cases where the same sample possesses different labels.

    Data Integrity checks - Conflicting labels
    Data Integrity checks – Conflicting Train labels
    Text Duplicates – Train Dataset and Text Dataset

    This check examines the presence of duplicate sample in both Train and Test sets. Our output shows that the Train data contains 24.95% duplicate data, while the Test data contains 7.69% duplicate data. These percentages exceed the threshold of 5%, alarming potential data redundancy that needs to be fixed.

    Data Integrity checks - Text Duplicates
    Data Integrity checks – Text Duplicates

    Similarly you can conduct checks for Text Property Outliers, Property Label Correlation, Special Characters, Under Annotated Metadata Segments etc checks. Please refer Deepchecks NLP documentation for more details. By rectifying these issues, you enhance the quality of the data.

    Train Test Evaluation

    Once you are confident that your data is all good to be trained, next step is to validate the split and compare train and test dataset.

    from deepchecks.nlp.suites import train_test_validation
    train_test_validation().run(train, test)
    Train Test Samples Mix

    The output shows one failed and one passed condition in our case. This check tells us about the percentage of test data samples that also appear in the training data. The goal here is to maintain a balance and avoiding an overlap between the two sets. In our case, about 29.49% of the test data samples also appear in our training data. The desired percentage is lower than 5% and we are above that threshold. This information is very valuable for fine tuning our dataset and model’s evaluation.

    Train Test Evaluation
    Train Test Evaluation

    Upon closer examination specific test samples were found to have corresponding instances within the training set(46 out of 156). To address this you can apply different split techniques suitable for your requirements to reduce the data overlap.

    Label Drift

    We also have a passed status for the “Label Drift” check. The Label Drift check acts as a measure to gauge the dissimilarity between the distributions of labels in the train and test datasets. It calculates the difference in label distributions between these datasets. The “Passed” status signifies that the label drift is within an acceptable range, with a minimal score of 0.15.

    Train Test Evaluation - Label Drift
    Train Test Evaluation – Label Drift

    This check also provides insights into the distributions, showing the top 10 categories with the most significant differences between the two datasets. Similarly, you an conduct Embedding Drift, NLP Property Drift checks. Please refer Deepchecks NLP documentation for more details

    Model Evaluation

    The model evaluation suite, is designed to be run after a model has been trained and requires model predictions and probabilities which can be supplied via the arguments in the run function.

    from deepchecks.nlp.suites import model_evaluation
    model_evaluation().run(train, test,    
    Train Test Performance

    The condition “Train Test Performance” failed in the output. The train test degradation score is a measure of how much a model’s performance drops when moving from the training dataset to the testing dataset. It’s an important metric to evaluate how well a model performs on the unseen data.

    Model Evaluation - Train Test Performance
    Model Evaluation – Train Test Performance

    The output also highlights that there are 2 scores that failed this degradation test. The most significant degradation observed is for the “Recall” metric, specifically for class 1 with a maximum degradation of 12.28%. It signifies a notable drop in the model’s ability to correctly identify instances belonging to class 1 when applied to new or unseen data. This could imply that the model is not maintaining consistent performance levels across training and testing data.

    Prediction Drift

    The condition for “Prediction Drift” passed in the output. It is indicating the outcome of monitoring the model for prediction drift using the Kolmogorov-Smirnov (KS) drift score. The KS drift score reported is 0.14. Even though the KS drift score of 0.14 is slightly lower than the specified threshold 0.15, it is important to analyze the source of this drift. Compare the distribution of predictions at various time points to pinpoint when the drift started occurring. Consider retraining or fine-tuning the model to ensure that the model remains effective and reliable in its predictions over time.

    Model Evaluation – Prediction Drift
    Model Evaluation – Prediction Drift

    Full Suite

    Deepchecks gives a quick overview of your model and data with Full Suite that includes many of the implemented checks. full_suite is a collection of the prebuilt checks.
    To import the full_suite method, use the code below

    from deepchecks.nlp.suites import full_suite
    suite = full_suite(),

    This will perform full checks on the dataset objects and the model and generates a consolidated output with Passed/Didn’t Pass.

    You can find the final code here.

    Using Deepchecks, helps us check and fine tune our models and data. It’s user-friendly and offers different tests to catch issues early, making our models work better. Happy Testing!!


    1. Deepchecks Documentation
    2. Deepchecks NLP: ML Validation for Text Made Easy
    3. High-standard ML validation with Deepchecks

    Hire technical testers from Qxf2

    Qxf2 offers valuable expertise in testing Machine Learning projects. Our team goes beyond traditional QA to ensure data quality, model accuracy, and system robustness. With us, you’ll have skilled testers who understand the nuances of ML and can enhance the reliability of your projects. Get in touch with us!

    Leave a Reply

    Your email address will not be published. Required fields are marked *