Imbalance Class is one of the most discussed problems in data science.
Let's say that I am offering you a free drink which I claim is both sweet and tangy but each time you take a drink, it always tastes sweet while it only tastes tangy once or twice. Given this scenario when your friend asks about the taste of your drink, then it is obvious you will say it is sweet. This is the problem with imbalance class and it becomes more important in scenarios like claim fraud or credit fraud detection or disease screening. Here actual cases of fraud are few compared to large number of genuine transactions. The imbalance could be 80 / 20 or more. Given this data set if you build the model then there is a high chance that it will predict 80% of the transactions as nonfraud. Thereby putting the task at high risk. So how do we deal with it? Can we have a few techniques that can help us?
Below are some of the commonly used technique to deal with imbalance class data set problem:
1. Majority class downsizes
This method involves removing observations from the majority class and bringing it at % equivalent to the minority class. This will curb its dominance on a learning algorithm. The most popular heuristic for doing so is the unreplaced resampling. It has a disadvantage of information loss
majority_resize= resample(majority, replace=False, n_samples='Match the minority')
2. Minority class upsizes
This method involves upsizing minority class to balance the proportions. It is the process of random observation duplication from minority class. The most common way is resampling with replacement.
minority_resize= resample(minority, replace=True, n_samples='Match the Majority')
3. SMOTE (Synthetic Minority Oversampling Technique)
It makes synthetic data points by identifying each minority sample's closest neighbors.As an example, a subset of data is taken from the minority class, and then new similar synthetic instances are created. you can use imbalance package and call SMOTE function from the package
sudo pip install imbalanced-learn
R, T = oversample.fit_resample(R, T)
(We will cover a detailed article on this topic)
4. Penalize Algorithm
Penalized learning algorithms increase the cost of minority classification errors. The key objective is to minimize a loss function and allow the algorithm to support the minority class. The commonly used technique is Penalized -SVM:
from sklearn.svm import SVC
5. Tree-Based Algorithm
On imbalanced datasets, decision trees also perform well because their hierarchical structure enables them to learn signals from both groups. This incorporates the predictions of the different classifications by using methods like bagging and boosting.
Hope you find this article useful. We will cover more detailed code based articles on each technique in future posts.
If you found this Article interesting, why not review the other Articles in our archive.