Data quality matters when building and refining a Classification Model

In the world of machine learning, data takes centre stage. It’s often said that data is the key to success. In this blog post, we emphasise the significance of data, especially when building a comment classification model. We will delve into how data quality, quantity, and biases significantly influence machine learning model performance. Additionally, we’ll explore techniques like undersampling as a strategy for addressing class imbalance.


Why This Post

As a tester, gaining insights into the world of machine learning model building is invaluable. Understanding the intricacies of how ML models are constructed can provide testers with a unique advantage. It enables you to craft more effective tests, come up with better test dataset, anticipate model behaviour, and identify potential pitfalls. At Qxf2 Services we always underscore the importance of understanding the application to test well. By diving into the inner workings of model development, you’ll be better equipped to design comprehensive tests and enhance the model’s practicality.


Problem: Building a comment classification model for WordPress comments.

We set out to address a particular challenge: using machine learning on our WordPress comment dataset to build a predictive model capable of distinguishing between spam and non-spam comments. While our WordPress plugin, Titan Anti Spam is effective in filtering out most spam comments, some still manage to slip through the cracks. We didn’t create this model to replace it, Instead, our goal was to learn about building models and gain insights into the significance of data. We developed a straightforward classification model for comment categorisation, opting for the Bayes classifier.

Please note that this blog primarily delves into our data manipulation journey rather than extensive model-building details.


Solution: Building and Refining the Model.

Preparing and Refining the Model Data

Our objective was to develop a model that could effectively distinguish between spam and non-spam comments. To achieve this, we utilized a dataset of comments from our WordPress website. The dataset contained various fields, but our focus was primarily on extracting the comments and their corresponding labels (0 or 1). These labels were obtained through a combination of manual human labelling of comments as spam or non spam and the utilization of a WordPress plugin, accumulated gradually over time. Additionally, some preprocessing steps were necessary to remove special characters and quotation marks. We also ensured that the labels were simplified to have ‘0’ for spam messages and ‘1’ for non-spam messages. This data was saved to a text file so that we could use it for building our classification model. You can find the refined data set here


Initial Model and Challenges

The main steps involved here were:

  • Data Extraction: We have already done most of the data extraction process as part of our previous step. Additionally, here, utilizing the CSV reader function we read the data from a text file and subsequently transformed it into a pandas data frame.
  • def load_data(input_data):
        """Load data from a text file and return a DataFrame."""
        txt_data = []
        with open(input_data, "r", encoding="utf-8") as txt_file:
            reader = csv.reader(txt_file, delimiter=',')
            # Skip the header row
            next(reader)
            for row in reader:
                if len(row) >= 2:
                    comment_text = row[0]
                    label = row[1]
                    txt_data.append((comment_text, label))
        return pd.DataFrame(txt_data, columns=['sentence', 'label'])
  • Data Preprocessing: Data preprocessing was a crucial step in improving the comprehensibility of the text. To enhance the comprehension of the text, we performed text cleaning and applied the SnowballStemmer. This technique was employed to improve our model’s ability to understand and process the text effectively.
  • def preprocess_data(input_data):
        """Preprocess the data, including text cleaning and stemming."""
        # Initialize the SnowballStemmer
        stemmer = SnowballStemmer("english")
        # Define custom preprocessing function using SnowballStemmer
        input_data['sentence'] = data['sentence'].apply(lambda x: " ".join([
            stemmer.stem(i) for i in x.split()]))
        return input_data
  • Training: In this step, we built a text classification model using a machine learning pipeline. This pipeline includes feature extraction and classification components. We employed TF-IDF vectorization to transform text data into numerical features, used the chi-squared test for feature selection, and chose the Multinomial Naive Bayes classifier as our machine learning algorithm. The model is then trained on the provided training data.
  • def train_model(x_train, y_train):
        """Train a text classification model."""
        # Define the pipeline with feature extraction and classification steps
        pipeline = Pipeline([
            ('vect', TfidfVectorizer(ngram_range=(1, 4), sublinear_tf=True)),
            ('chi', SelectKBest(chi2, k=1000)),
            ('clf', MultinomialNB())  # Using Naive Bayes classifier
        ])
     
        # Fit the model
        model = pipeline.fit(x_train, y_train)
        return model
  • Model Evaluation: Once the model was trained and ready, we added a step to check its performance through an evaluation step. Here, we employed various metrics such as accuracy, precision, recall, and F1-score, in addition to utilizing a confusion matrix. This evaluation step is crucial to gauge the model’s effectiveness when applied to the testing dataset.
  • def evaluate_model(model, x_test, y_test):
        """Evaluate the model on the test data."""
        # Predict the labels for the test set
        y_pred = model.predict(x_test)
     
        # Calculate performance metrics
        acc_score = accuracy_score(y_test, y_pred)
        conf_matrix = confusion_matrix(y_test, y_pred)
        recall = recall_score(y_test, y_pred, pos_label='1')
        precision = precision_score(y_test, y_pred, pos_label='1')
        f1score = f1_score(y_test, y_pred, pos_label='1')
     
        return acc_score, conf_matrix, recall, precision, f1score

    You can find the complete code here

    After running the tests against this model, the results looked as shown below

    comment_classification_result1
    model evaluation result

    The model had high accuracy at around 96.9% and did a good job in identifying spam messages. However it struggled with non-spam messages, getting them right only 15.81% of the time. This difference in performance suggests that the model might be leaning too much towards classifying messages as spam. This bias could be due to the training data, which had approximately 36,285 spam messages and about 1,429 non-spam messages.


    Undersampling Strategies

    As we realised that our data had a lot more spam messages than non-spam ones, we explored ways to fix this imbalance. We found a technique called RandomUnderSampler, which helps us under-sample the majority class by randomly picking samples. This allows us to balance the majority class. By setting the sampling_strategy to 1.0, we ensure an equal percentage of spam and non-spam messages.

    def undersample_data(x_data, y_data, target_percentage=1.0):
        """Undersample the majority class to achieve the target percentage."""
        sampler = RandomUnderSampler(sampling_strategy=target_percentage, random_state=42)
        # Convert the Pandas Series to a NumPy array and reshape them
        x_data = x_data.to_numpy().reshape(-1, 1)
        y_data = y_data.to_numpy().ravel()  # Use ravel to convert to 1D array
     
        # Undersample the data
        x_resampled, y_resampled = sampler.fit_resample(x_data, y_data)
     
        # Convert the NumPy arrays back to Pandas Series
        x_resampled_series = pd.Series(x_resampled[:, 0])
        y_resampled_series = pd.Series(y_resampled)
     
        return x_resampled_series, y_resampled_series

    Below is the results with target_percentage=1.0

    
Random_Undersampling_1.0

    Result with random undersampling 1.0

    When we applied Random Undersampling with a ratio of 1.0, the model improved its ability to identify non-spam messages, although the overall accuracy decreased. Therefore, we experimented with various sampling strategies. Using a Random undersampling ratio of 0.3, we observed significant improvements in accuracy and the F1 score.

    Random Undersampling 0.3
    Result with random undersampling 0.3

    We can now put our model into action to effectively classify WordPress comments. For future use, we’ve also stored the model to a pickle file.


    Conclusion and Takeaways

    In short, we’ve discussed how we built a model to classify WordPress comments. We covered the steps from identifying the problem to finding a solution. Our solution involved preparing data, creating an initial model, and dealing with data imbalance problem using undersampling technique.

    But, our journey isn’t over yet! In our next blog post, we’ll explore how to test these models and make them even better. So, stay tuned for more tips on improving your machine learning models. As QA professionals, understanding the insights behind these models is crucial for us. It helps us test the models effectively and ensure their quality. Stay tuned for more insights into the world of machine learning and quality assurance!


    Hire QA from Qxf2 Services

    Qxf2 is a group of skilled technical testers proficient in both conventional testing approaches and addressing the distinct challenges of testing contemporary software systems. Our expertise lies in testing micro services, data pipelines, and AI/ML-driven applications. Qxf2 engineers are adept at working independently and thrive in small engineering teams. Feel free to contact us here.


    One thought on “%1$s”

    Leave a Reply

    Your email address will not be published. Required fields are marked *