{"id":20262,"date":"2023-11-20T06:21:52","date_gmt":"2023-11-20T11:21:52","guid":{"rendered":"https:\/\/qxf2.com\/blog\/?p=20262"},"modified":"2023-11-20T06:21:52","modified_gmt":"2023-11-20T11:21:52","slug":"data-quality-matters-when-building-and-refining-a-classification-model","status":"publish","type":"post","link":"https:\/\/qxf2.com\/blog\/data-quality-matters-when-building-and-refining-a-classification-model\/","title":{"rendered":"Data quality matters when building and refining a Classification Model"},"content":{"rendered":"<p>In the world of machine learning, data takes centre stage. It&#8217;s often said that data is the key to success. In this blog post, we emphasise the significance of data, especially when building a comment classification model. We will delve into how data quality, quantity, and biases significantly influence machine learning model performance. Additionally, we&#8217;ll explore techniques like undersampling as a strategy for addressing class imbalance.<\/p>\n<hr \/>\n<h4>Why This Post<\/h4>\n<p> As a tester, gaining insights into the world of machine learning model building is invaluable. Understanding the intricacies of how ML models are constructed can provide testers with a unique advantage. It enables you to craft more effective tests, come up with better test dataset, anticipate model behaviour, and identify potential pitfalls. At <a href=\"https:\/\/qxf2.com\/?utm_source=buildingclassificationmodels&#038;utm_medium=click&#038;utm_campaign=From%20blog\">Qxf2 Services<\/a> we always underscore the importance of understanding the application to test well. By diving into the inner workings of model development, you&#8217;ll be better equipped to design comprehensive tests and enhance the model&#8217;s practicality.<\/p>\n<hr \/>\n<h4>Problem: Building a comment classification model for WordPress comments.<\/h4>\n<p>We set out to address a particular challenge: using machine learning on our WordPress comment dataset to build a predictive model capable of distinguishing between spam and non-spam comments. While our WordPress plugin, <a href=\"https:\/\/wordpress.org\/plugins\/anti-spam\/\" rel=\"noopener\" target=\"_blank\">Titan Anti Spam <\/a>is effective in filtering out most spam comments, some still manage to slip through the cracks. We didn&#8217;t create this model to replace it, Instead, our goal was to learn about building models and gain insights into the significance of data. We developed a straightforward classification model for comment categorisation, opting for the Bayes classifier. <\/p>\n<p>Please note that this blog primarily delves into our data manipulation journey rather than extensive model-building details.<\/p>\n<hr \/>\n<h4>Solution: Building and Refining the Model.<\/h4>\n<h5> Preparing and Refining the Model Data<\/h5>\n<p>Our objective was to develop a model that could effectively distinguish between spam and non-spam comments. To achieve this, we utilized a dataset of comments from our WordPress website. The dataset contained various fields, but our focus was primarily on extracting the comments and their corresponding labels (0 or 1). These labels were obtained through a combination of manual human labelling of comments as spam or non spam and the utilization of a WordPress plugin, accumulated gradually over time. Additionally, some preprocessing steps were necessary to remove special characters and quotation marks. We also ensured that the labels were simplified to have &#8216;0&#8217; for spam messages and &#8216;1&#8217; for non-spam messages. This data was saved to a text file so that we could use it for building our classification model. You can find the refined data set <a href=\"https:\/\/gist.github.com\/avinash010\/682f288c99c6c7781a0576eea6cf5bd6#file-wp_comments-txt\" rel=\"noopener\" target=\"_blank\">here<\/a><\/p>\n<hr \/>\n<h5>Initial Model and Challenges<\/h5>\n<p>The main steps involved here were:<\/p>\n<li> Data Extraction: We have already done most of the data extraction process as part of our previous step. Additionally, here, utilizing the CSV reader function we read the data from a text file and subsequently transformed it into a pandas data frame.<\/li>\n<pre lang=\"python\">\r\ndef load_data(input_data):\r\n    \"\"\"Load data from a text file and return a DataFrame.\"\"\"\r\n    txt_data = []\r\n    with open(input_data, \"r\", encoding=\"utf-8\") as txt_file:\r\n        reader = csv.reader(txt_file, delimiter=',')\r\n        # Skip the header row\r\n        next(reader)\r\n        for row in reader:\r\n            if len(row) >= 2:\r\n                comment_text = row[0]\r\n                label = row[1]\r\n                txt_data.append((comment_text, label))\r\n    return pd.DataFrame(txt_data, columns=['sentence', 'label'])\r\n\r\n<\/pre>\n<li>Data Preprocessing: Data preprocessing was a crucial step in improving the comprehensibility of the text. To enhance the comprehension of the text, we performed text cleaning and applied the <a href=\"https:\/\/www.nltk.org\/_modules\/nltk\/stem\/snowball.html\" rel=\"noopener\" target=\"_blank\">SnowballStemmer<\/a>. This technique was employed to improve our model&#8217;s ability to understand and process the text effectively.<\/li>\n<p><\/p>\n<pre lang=\"python\">\r\ndef preprocess_data(input_data):\r\n    \"\"\"Preprocess the data, including text cleaning and stemming.\"\"\"\r\n    # Initialize the SnowballStemmer\r\n    stemmer = SnowballStemmer(\"english\")\r\n    # Define custom preprocessing function using SnowballStemmer\r\n    input_data['sentence'] = data['sentence'].apply(lambda x: \" \".join([\r\n        stemmer.stem(i) for i in x.split()]))\r\n    return input_data\r\n<\/pre>\n<li>Training: In this step, we built a text classification model using a machine learning pipeline. This pipeline includes feature extraction and classification components. We employed <a href=\"https:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.feature_extraction.text.TfidfVectorizer.html\" rel=\"noopener\" target=\"_blank\">TF-IDF vectorization<\/a> to transform text data into numerical features, used the <a href=\"https:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.feature_selection.chi2.html\" buildingclassificationmodels>chi-squared<\/a> test for feature selection, and chose the <a href=\"https:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.naive_bayes.MultinomialNB.html\" rel=\"noopener\" target=\"_blank\">Multinomial Naive Bayes classifier<\/a> as our machine learning algorithm. The model is then trained on the provided training data.<\/li>\n<p><\/p>\n<pre lang=\"python\">\r\ndef train_model(x_train, y_train):\r\n    \"\"\"Train a text classification model.\"\"\"\r\n    # Define the pipeline with feature extraction and classification steps\r\n    pipeline = Pipeline([\r\n        ('vect', TfidfVectorizer(ngram_range=(1, 4), sublinear_tf=True)),\r\n        ('chi', SelectKBest(chi2, k=1000)),\r\n        ('clf', MultinomialNB())  # Using Naive Bayes classifier\r\n    ])\r\n\r\n    # Fit the model\r\n    model = pipeline.fit(x_train, y_train)\r\n    return model\r\n<\/pre>\n<li>Model Evaluation: Once the model was trained and ready, we added a step to check its performance through an evaluation step. Here, we employed various metrics such as accuracy, <a href=\"https:\/\/towardsdatascience.com\/a-look-at-precision-recall-and-f1-score-36b5fd0dd3ec\" rel=\"noopener\" target=\"_blank\">precision, recall, and F1-score<\/a>, in addition to utilizing a confusion matrix. This evaluation step is crucial to gauge the model&#8217;s effectiveness when applied to the testing dataset.<\/li>\n<p><\/p>\n<pre lang=\"python\">\r\ndef evaluate_model(model, x_test, y_test):\r\n    \"\"\"Evaluate the model on the test data.\"\"\"\r\n    # Predict the labels for the test set\r\n    y_pred = model.predict(x_test)\r\n\r\n    # Calculate performance metrics\r\n    acc_score = accuracy_score(y_test, y_pred)\r\n    conf_matrix = confusion_matrix(y_test, y_pred)\r\n    recall = recall_score(y_test, y_pred, pos_label='1')\r\n    precision = precision_score(y_test, y_pred, pos_label='1')\r\n    f1score = f1_score(y_test, y_pred, pos_label='1')\r\n\r\n    return acc_score, conf_matrix, recall, precision, f1score\r\n<\/pre>\n<p>You can find the complete code <a href=\"https:\/\/gist.github.com\/avinash010\/682f288c99c6c7781a0576eea6cf5bd6\" rel=\"noopener\" target=\"_blank\">here<\/a><\/p>\n<p>After running the tests against this model, the results looked as shown below<br \/>\n<figure id=\"attachment_20300\" aria-describedby=\"caption-attachment-20300\" style=\"width: 1177px\" class=\"wp-caption aligncenter\"><a href=\"https:\/\/qxf2.com\/blog\/wp-content\/uploads\/2023\/09\/comment_classification_result1.png\" data-rel=\"lightbox-image-0\" data-rl_title=\"\" data-rl_caption=\"\" title=\"\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/qxf2.com\/blog\/wp-content\/uploads\/2023\/09\/comment_classification_result1.png\" alt=\"comment_classification_result1\" width=\"1177\" height=\"277\" class=\"size-full wp-image-20300\" srcset=\"https:\/\/qxf2.com\/blog\/wp-content\/uploads\/2023\/09\/comment_classification_result1.png 1177w, https:\/\/qxf2.com\/blog\/wp-content\/uploads\/2023\/09\/comment_classification_result1-300x71.png 300w, https:\/\/qxf2.com\/blog\/wp-content\/uploads\/2023\/09\/comment_classification_result1-1024x241.png 1024w, https:\/\/qxf2.com\/blog\/wp-content\/uploads\/2023\/09\/comment_classification_result1-768x181.png 768w\" sizes=\"auto, (max-width: 1177px) 100vw, 1177px\" \/><\/a><figcaption id=\"caption-attachment-20300\" class=\"wp-caption-text\">model evaluation result<\/figcaption><\/figure><\/p>\n<p>The model had high accuracy at around 96.9% and did a good job in identifying spam messages. However it struggled with non-spam messages, getting them right only 15.81% of the time. This difference in performance suggests that the model might be leaning too much towards classifying messages as spam. This bias could be due to the training data, which had approximately 36,285 spam messages and about 1,429 non-spam messages.<\/p>\n<hr>\n<h5>Undersampling Strategies<\/h5>\n<p>As we realised that our data had a lot more spam messages than non-spam ones, we explored ways to fix this imbalance. We found a technique called <a href=\"https:\/\/imbalanced-learn.org\/stable\/references\/generated\/imblearn.under_sampling.RandomUnderSampler.html\" rel=\"noopener\" target=\"_blank\">RandomUnderSampler<\/a>, which helps us under-sample the majority class by randomly picking samples. This allows us to balance the majority class. By setting the sampling_strategy to 1.0, we ensure an equal percentage of spam and non-spam messages.<\/p>\n<pre lang=\"python\">\r\ndef undersample_data(x_data, y_data, target_percentage=1.0):\r\n    \"\"\"Undersample the majority class to achieve the target percentage.\"\"\"\r\n    sampler = RandomUnderSampler(sampling_strategy=target_percentage, random_state=42)\r\n    # Convert the Pandas Series to a NumPy array and reshape them\r\n    x_data = x_data.to_numpy().reshape(-1, 1)\r\n    y_data = y_data.to_numpy().ravel()  # Use ravel to convert to 1D array\r\n\r\n    # Undersample the data\r\n    x_resampled, y_resampled = sampler.fit_resample(x_data, y_data)\r\n\r\n    # Convert the NumPy arrays back to Pandas Series\r\n    x_resampled_series = pd.Series(x_resampled[:, 0])\r\n    y_resampled_series = pd.Series(y_resampled)\r\n\r\n    return x_resampled_series, y_resampled_series\r\n<\/pre>\n<p>Below is the results with target_percentage=1.0<br \/>\n<figure id=\"attachment_20320\" aria-describedby=\"caption-attachment-20320\" style=\"width: 1193px\" class=\"wp-caption aligncenter\"><a href=\"https:\/\/qxf2.com\/blog\/wp-content\/uploads\/2023\/10\/Random_Undersampling_1.0.png\" data-rel=\"lightbox-image-1\" data-rl_title=\"\" data-rl_caption=\"\" title=\"\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/qxf2.com\/blog\/wp-content\/uploads\/2023\/10\/Random_Undersampling_1.0.png\" alt=\"\nRandom_Undersampling_1.0\" width=\"1193\" height=\"272\" class=\"size-full wp-image-20320\" srcset=\"https:\/\/qxf2.com\/blog\/wp-content\/uploads\/2023\/10\/Random_Undersampling_1.0.png 1193w, https:\/\/qxf2.com\/blog\/wp-content\/uploads\/2023\/10\/Random_Undersampling_1.0-300x68.png 300w, https:\/\/qxf2.com\/blog\/wp-content\/uploads\/2023\/10\/Random_Undersampling_1.0-1024x233.png 1024w, https:\/\/qxf2.com\/blog\/wp-content\/uploads\/2023\/10\/Random_Undersampling_1.0-768x175.png 768w\" sizes=\"auto, (max-width: 1193px) 100vw, 1193px\" \/><\/a><figcaption id=\"caption-attachment-20320\" class=\"wp-caption-text\"><br \/>Result with random undersampling 1.0<\/figcaption><\/figure><br \/>\nWhen we applied Random Undersampling with a ratio of 1.0, the model improved its ability to identify non-spam messages, although the overall accuracy decreased. Therefore, we experimented with various sampling strategies. Using a Random undersampling ratio of 0.3, we observed significant improvements in accuracy and the F1 score.<\/p>\n<figure id=\"attachment_20321\" aria-describedby=\"caption-attachment-20321\" style=\"width: 1187px\" class=\"wp-caption aligncenter\"><a href=\"https:\/\/qxf2.com\/blog\/wp-content\/uploads\/2023\/10\/Random_Undersampling_0.3.png\" data-rel=\"lightbox-image-2\" data-rl_title=\"\" data-rl_caption=\"\" title=\"\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/qxf2.com\/blog\/wp-content\/uploads\/2023\/10\/Random_Undersampling_0.3.png\" alt=\"Random Undersampling 0.3\" width=\"1187\" height=\"271\" class=\"size-full wp-image-20321\" srcset=\"https:\/\/qxf2.com\/blog\/wp-content\/uploads\/2023\/10\/Random_Undersampling_0.3.png 1187w, https:\/\/qxf2.com\/blog\/wp-content\/uploads\/2023\/10\/Random_Undersampling_0.3-300x68.png 300w, https:\/\/qxf2.com\/blog\/wp-content\/uploads\/2023\/10\/Random_Undersampling_0.3-1024x234.png 1024w, https:\/\/qxf2.com\/blog\/wp-content\/uploads\/2023\/10\/Random_Undersampling_0.3-768x175.png 768w\" sizes=\"auto, (max-width: 1187px) 100vw, 1187px\" \/><\/a><figcaption id=\"caption-attachment-20321\" class=\"wp-caption-text\">Result with random undersampling 0.3<\/figcaption><\/figure>\n<p>We can now put our model into action to effectively classify WordPress comments. For future use, we&#8217;ve also stored the model to a pickle file.<\/p>\n<hr>\n<h4>Conclusion and Takeaways<\/h4>\n<p>In short, we&#8217;ve discussed how we built a model to classify WordPress comments. We covered the steps from identifying the problem to finding a solution. Our solution involved preparing data, creating an initial model, and dealing with data imbalance problem using undersampling technique.<\/p>\n<p>But, our journey isn&#8217;t over yet! In our next blog post, we&#8217;ll explore how to test these models and make them even better. So, stay tuned for more tips on improving your machine learning models. As QA professionals, understanding the insights behind these models is crucial for us. It helps us test the models effectively and ensure their quality. Stay tuned for more insights into the world of machine learning and quality assurance!<\/p>\n<hr>\n<h4>Hire QA from Qxf2 Services<\/h4>\n<p>Qxf2 is a group of skilled technical testers proficient in both conventional testing approaches and addressing the distinct challenges of testing contemporary software systems. Our expertise lies in testing micro services, data pipelines, and AI\/ML-driven applications. Qxf2 engineers are adept at working independently and thrive in small engineering teams. Feel free to contact us <a href=\"https:\/\/qxf2.com\/contact?utm_source=buildingclassificationmodels&#038;utm_medium=click&#038;utm_campaign=From%20blog\">here<\/a>.<\/p>\n<hr>\n","protected":false},"excerpt":{"rendered":"<p>In the world of machine learning, data takes centre stage. It&#8217;s often said that data is the key to success. In this blog post, we emphasise the significance of data, especially when building a comment classification model. We will delve into how data quality, quantity, and biases significantly influence machine learning model performance. Additionally, we&#8217;ll explore techniques like undersampling as [&hellip;]<\/p>\n","protected":false},"author":3,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[355,130,390],"tags":[],"class_list":["post-20262","post","type-post","status-publish","format-standard","hentry","category-ai-testing","category-machine-learning","category-undersampling"],"_links":{"self":[{"href":"https:\/\/qxf2.com\/blog\/wp-json\/wp\/v2\/posts\/20262","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/qxf2.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/qxf2.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/qxf2.com\/blog\/wp-json\/wp\/v2\/users\/3"}],"replies":[{"embeddable":true,"href":"https:\/\/qxf2.com\/blog\/wp-json\/wp\/v2\/comments?post=20262"}],"version-history":[{"count":53,"href":"https:\/\/qxf2.com\/blog\/wp-json\/wp\/v2\/posts\/20262\/revisions"}],"predecessor-version":[{"id":20454,"href":"https:\/\/qxf2.com\/blog\/wp-json\/wp\/v2\/posts\/20262\/revisions\/20454"}],"wp:attachment":[{"href":"https:\/\/qxf2.com\/blog\/wp-json\/wp\/v2\/media?parent=20262"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/qxf2.com\/blog\/wp-json\/wp\/v2\/categories?post=20262"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/qxf2.com\/blog\/wp-json\/wp\/v2\/tags?post=20262"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}