Baseline Model Comparison for Performance Evaluation

Machine learning models evolve. As a tester, how do we know the newer version of the model is better? How do we know that the model did not get worse in other areas? The most intuitive approach would be to design a ‘good’ labelled dataset and then calculate the evaluation score like the F1 score for the model under test. If the newer version of the model scores higher, then we can be somewhat confident that the newer version is better.

In this post, we want to show another approach to use alongside (not replacement!) the labelled dataset approach. Our approach involves testing a model by evaluating its performance relative to a baseline model. This approach proves particularly advantageous when you are testing a model for first time against a dataset or when the model’s capabilities advance, and we encounter new datasets that lack benchmark comparisons. Through this comparison with a baseline model, we can ascertain whether our test model performed better or worse with the updated or new datasets. This additional data point plays a crucial role in enabling testers to assess the quality of the new model and determine the acceptability of its results.

Why this post?

Understanding the concept of testing an ML model through comparative analysis with another model is relatively straightforward. However, the challenge lies in finding an affordable and user-friendly model that can serve as a suitable baseline for comparison. This is where platforms like Hugging Face and ChatGPT come into play. We were excited to find that Hugging Face provides APIs and tools that make it effortless to download, train, and leverage a wide range of pre-trained models. Utilising an existing model from this extensive repository is a quick and straightforward process. As technical testers, we found it incredibly convenient to utilise some of the models available in this vast collection of pre-trained models. Additionally, our objective is to demonstrate how ChatGPT can be utilized as a semantic classifier with minimal coding. We aim to showcase the potential of leveraging ChatGPT, which can effectively classify semantic information


Imagine we are working with a Sentiment Analysis model as the focus of our testing. Our evaluation involves analysing the model’s predictions and assessing its ability to accurately identify the sentiment of input data. Additionally, we need to create a labelled dataset to validate the performance and quality of our models with respect to that specific dataset. Here are the steps involved

1. Select a specific model that will be the main focus of our testing.
2. Create a carefully labelled dataset to validate and assess the accuracy of our testing process.

1. Model under test – Roberta

For our demo we will consider a Roberta model cardiffnlp/twitter-roberta-base-sentiment which is trained on ~58M tweets and finetuned for sentiment analysis.

We won’t delve into the specific details of how to use a Transformer model. However, you can find the comprehensive instructions and guidelines for utilising a Transformer model in here. To get setup quickly you need to pip install transformers and torch which is a machine learning library.

pip install transformers
pip install torch

We initiate the process by specifying the model name. Next, we employ Hugging Face’s AutoModelForSequenceClassification and AutoTokenizer to load the pre-trained model. By utilising the pipeline feature, we can conveniently leverage the model. With the classifier in place, we can predict the sentiment for any given tweet.

from transformers import pipeline, AutoTokenizer, AutoModelForSequenceClassification
model_name = "cardiffnlp/twitter-roberta-base-sentiment-latest"
model = AutoModelForSequenceClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
classifier = pipeline("sentiment-analysis", model=model, tokenizer=tokenizer)
results = classifier(tweet)

2. Data set

For this exercise, we have selected a dataset from Kaggle containing FIFA World Cup 2022 Tweets. The dataset consists of over 22,500 tweets specifically related to the FIFA 2022 World Cup. It is a labeled dataset, meaning that each tweet has been categorised as either POSITIVE, NEGATIVE, or NEUTRAL sentiment. However, we will not be testing on the entire dataset. Instead, for our experiment, we will focus on just 100 data points from the dataset to assess the performance of our models on this subset of data.

Upon analysing the sentiment in the dataset, we discovered that we was dissatisfied with the assigned labels for certain records. It became necessary to clean up the dataset and address cases where biases and errors were present. This step of data cleaning is crucial when preparing a dataset for testing purposes. In a previous project, we implemented a process where a group of individuals collaboratively labelled the dataset and made labelling decisions based on the results obtained from a larger group. This approach helped us mitigate the risk of bias and reduce the likelihood of errors during the labelling process.


Once you obtain the sentiment predictions of the model against a labelled dataset, you can observe how many sentiments align with the dataset. However, to determine the quality of these results, it is beneficial to have a baseline model for comparison. This baseline model is particularly useful when encountering new datasets or introducing additional features to the model that need to be tested with different datasets. It allows for an evaluation of the model’s performance in relation to a reference point, enabling a more comprehensive assessment.

To implement this approach:

1. Choose a model that will serve as the baseline for comparison.
2. Develop a script to compare the performance of the two models.

1. Choose a model that will serve as the baseline for comparison

To come up with the Baseline model, there are various Sentiment Analysis models available within Hugging Face that can be utilised. In our case, we will employ ChatGPT as an example to demonstrate how ChatGPT LLM can function as a classifier for sentiments. This technique is highly valuable for testers to comprehend, as it is easy to implement. It also exhibited reasonably good results in tweet classification tasks

To obtain the sentiment of a tweet using ChatGPT, we utilise the completions API endpoint. You can find detailed information about it here. We process one tweet at a time to address any token size limitations. While this approach may be slower, it ensures accurate classification. I had to make adjustments to the prompt in order to achieve the desired sentiment classification. Furthermore, I implemented code-level modifications to eliminate any extra characters that I observed in the tweet response.

import openai
prompt = "Please analyze the sentiment for the following football World Cup tweet and classify it as either POSITIVE, NEGATIVE, or NEUTRAL only. Ensure that the GPT response contains only the sentiment classifier in all caps, without any unnecessary characters or special symbols."
input_text = f"{prompt} {dynamic_message}"
# Define the parameters for the API call
response = openai.Completion.create(model='text-davinci-003', prompt=input_text)

2. Script to compare the performance of the two models

The testing script includes functions to load tweets from a CSV file and utilise the analyze_sentiment method, which generates sentiments using both the Roberta and ChatGPT models. Furthermore, we have incorporated checks to compare inconsistencies in sentiments between the models and the true sentiments, as well as discrepancies in sentiments generated by the two models. These checks allow us to quantify the number of mismatches between the sentiments predicted by the model and the actual sentiments, as well as the mismatches between sentiments produced by both models.

def get_mismatched_model_sentiments(self, model_results, tweets):
        "Get tweets with mismatched sentiments between the model and actual sentiments"
        mismatched_tweets = []
        for tweet_id, model_result in model_results.items():
            # In case sentiment received was in lowercase
            actual_sentiment = tweets[tweet_id][2].upper()
            if model_result.upper() != actual_sentiment:
                tweet_text = tweets[tweet_id][1]
                    (tweet_id, tweet_text, actual_sentiment, model_result))
        return mismatched_tweets
def get_mismatched_tweets_between_models(self, model1_results, model2_results, tweets):
        "Get tweets with mismatched sentiments between two models"
        mismatched_tweets = []
        for tweet_id, model1_result in model1_results.items():
            model2_result = model2_results.get(tweet_id)
            if model1_result.upper() != model2_result.upper():
                tweet_text = tweets[tweet_id][1]
                    (tweet_id, tweet_text, model1_result, model2_result))
        return mismatched_tweets

The function “save_sentiments_to_csv” enables us to store the sentiments in a file, facilitating easier comparison and analysis.

def save_sentiments_to_csv(self, csv_file, tweets, model1_results, model2_results):
        "Save all sentiments to a common CSV file"
        with open(csv_file, 'w+', encoding="utf8", newline='') as file:
            writer = csv.writer(file)
            writer.writerow(['ID', 'Tweet', 'Actual Sentiment',
                            'Model1 Sentiment', 'Model2 Sentiment'])
            for (tweet_id, tweet, actual_sentiment) in tweets:
                model1_sentiment = model1_results.get(tweet_id, '')
                model2_sentiment = model2_results.get(tweet_id, '')
                writer.writerow([tweet_id, tweet, actual_sentiment,
                                model1_sentiment, model2_sentiment])

Here is the complete code snippet.


The outcome reveals the count of sentiment mismatches between the model and the actual sentiments. Additionally, it displays the number of sentiment mismatches between the different models. If you wish to perform a comparison on individual tweets, you can refer to the saved CSV file for further analysis.

(transformer_venv) avinash@avinash-Inspiron:~/qxf2/transformer_project/Examples/Sentiment_analysis$ python3 
Sentiments saved to csv file
No of mismatched ROBERTA Sentiments with Actual Sentiment: 3
No of mismatched GPT Sentiments with Actual Sentiment: 20
No of mismatched Sentiments between ROBERTA and GPT: 19


The above use case gives you a feel of what may be the steps needed when you are performing comparative analysis between models. The approach suggested above helps us to test a model by comparing it with another model, providing valuable insights into their performance. In conclusion, by comparing a model’s performance with a baseline model and leveraging platforms like Hugging Face and ChatGPT, testers can effectively evaluate model improvements and utilise pre-trained models for efficient testing. This approach provides valuable insights and enhances the testing process in the evolving landscape of machine learning models.

Furthermore, you can also include additional checks to ensure that the model exhibits only slight deviations and avoids producing contradictory results. For instance, it is essential to verify that positive tweets are not incorrectly classified as negative. These additional checks help maintain the model’s accuracy and prevent it from generating opposite or conflicting outcomes.

Hire QA from Qxf2 Services

By partnering with Qxf2, you gain the advantage of collaborating with a team of technical testers who possess exceptional skills in both traditional testing methodologies and the ability to overcome the distinctive challenges posed by contemporary software systems. Our expertise extends beyond conventional test automation, as we specialise in testing micro services, data pipelines, and applications based on artificial intelligence and machine learning. You can get in touch with us here.

Leave a Reply

Your email address will not be published. Required fields are marked *