How to Solve the Problem of Imbalanced Datasets: Meet Tonic Data Science Mode

When classifying rare events, Tonic Data Science Mode creates synthetic data to augment the minority class and improve the accuracy of your model

6 min readDec 16, 2022

Data Scientists often struggle with bias in their machine learning models. Building an accurate model to classify a rare event proves to be challenging when you have a limited amount of data for your minority class.

Typical ways of addressing this issue can be random downsampling the majority class — getting rid of useful data — random upsampling the minority class — leading to overfit models — and creating synthetic data that matches the patterns in the real data.

In this post we’ll examine how a new data synthesis platform specifically designed for data scientists, Tonic Data Science Mode (DSM), helps address biases in classification models caused by imbalanced classes. We compare the performance of Logistic Regression, XGBoost, and CatBoost models trained on datasets balanced using DSM, SMOTE, and SMOTE-NC.

We’ll find that balancing our target class with synthetic data generated with DSM leads to improvements in the ROC AUC and F1 scores of models.

Our Dataset

We will use a dataset from Kaggle to predict customer churn — whether or not a customer has left the service in the last month — from a fictional telecom company, Telco. There are 7,032 rows in this dataset with each representing a customer and their characteristics captured in the 21 columns. These columns include a mix of numeric and categorical features.

Being able to predict which customers will churn allows the telecom company to deploy targeted customer retention measures to prevent resulting losses in revenue.

This dataset is clearly imbalanced, with churned customers represented only one-third as frequently as non-churned customers. Let’s do some initial modeling to see how well models respond to the imbalanced data to set a baseline.

Comparing Models Trained with Imbalanced Data and Data Balanced Using Data Science Mode, SMOTE, and SMOTE-NC

We evaluate three models — Logistic Regression, XGBoost, and Catboost — performing the following procedures:

Train-test split the data 75–25.*
One-hot encode the categorical variables for Logistic Regression and XGBoost.
Evaluate model performance with a variety of metrics, including ROC AUC, F1, and confusion matrices.

Imbalanced Data

The positive class labeled as 1 represents customer churn

All of these models have similar ROC AUC scores and much higher true negative rates compared to their true positive rates. This is a sign of bias for predicting the negative class from training the models on imbalanced data. For an imbalanced dataset, however, their ROC AUC scores are relatively strong — all around 83%.

Using Synthetic Data to Balance Customer Churn

Since there is about 24% less data on customers who churn, our models have a greater opportunity to learn patterns associated with those in the majority class, making it more difficult for our models to make correct classifications. We address this class imbalance by augmenting our training data with synthetic minority samples.

DSM produces synthetic data that mimics the patterns and distributions of real data, allowing us to balance our training data with highly realistic synthetic samples. We compare DSM against SMOTE and SMOTE-NC.

Note: Both DSM and SMOTE have some inherent randomness in how they generate data. To fully test the performances of these augmentation methods, we sample data with each augmentation technique 100 times and look at the distributions of the ROC AUC and F1 scores. This gives us a more accurate picture of how these augmentation methods perform rather than just fishing for the best random states.

How Do the Scores Stack Up?

First, let’s take a look at the distributions of the ROC AUC scores for each model by augmentation method.

Wow! The ROC AUC distribution from the CatBoost model trained on DSM-augmented data beat not only the unaugmented model’s score, but also the score distributions for SMOTE and SMOTE-NC data. We can confidently say that the synthetic data generated in DSM improves the CatBoost model’s accuracy for this imbalanced dataset. The XGBoost model also performed very well with the DSM data with a median ROC AUC score lying above the interquartile range of the SMOTE model.

For Logistic Regression, none of the augmentation methods improved the ROC AUC score obtained by the model trained on the original imbalanced dataset. Other experiments exploring several data augmentation methods to balance minority classes for Logistic Regression have found similar results, suggesting that data augmentation should be used with caution with Logistic Regression.

Now let’s take a look at the F1 scores of the models:

Data augmentation drastically improves the accuracy with which our models classify customer churn. Data from DSM shows superiority over the other augmentation methods in improving performance of the CatBoost model with its entire F1 score distribution being higher than SMOTE and SMOTE-NC.

The data from DSM’s killer performance at improving CatBoost models is very encouraging. Recalling the confusion matrices of the imbalanced data, the CatBoost model had similar true positive and false negative rates — let’s look and see how DSM-augmentation improves these metrics.

Our Catboost model trained with DSM-augmented data is clearly much better at correctly classifying customer churn over the baseline model, showing a 23.7% improvement in the true positive rate and an 8% increase in F1 score. These model improvements could have a dramatic impact on Telco’s bottom line by enhancing customer retention efforts with better accuracy.

A Seamless Way to Augment Data

DSM is an AI-powered generative model specifically designed to mimic data to solve complex data science problems. Here we’ve shown that the data generated in DSM improves model performance on an imbalanced dataset better than SMOTE and SMOTE-NC methods.

If you would like to recreate this experiment, the complete jupyter notebook can be found on GitHub and the original blog post can be found on Tonic.ai’s website. Also, if you’re curious about using DSM you can head to tonic.ai to sign up for a free trial.

*Note that DSM and other data augmentation tools are training on split data in this experiment to avoid data leakage when training classification models.

**Note on the author: Madelyn is employed by Tonic.ai as a Data Science Evangelist.