ML | Handling Imbalanced Data with SMOTE and Near Miss Algorithm in Python

Q: What is ML | Handling Imbalanced Data with SMOTE and Near Miss Algorithm in Python?

In this article, we will learn ML | Handling Imbalanced Data with SMOTE and Near Miss Algorithm in Python,This free Machine Learning tutorial for complete beginners will help you learn Machine Learning from scratch.

Label Encoding in Python

Ordinary Least Squares (OLS) using statsmodels

Imbalanced Data Distribution

Fraud Detection

Anomaly Detection

Facial recognition

majority

recall

Imbalanced Data Handling Techniques:

SMOTE
Near Miss Algorithm

SMOTE (Synthetic Minority Oversampling Technique) – Oversampling

virtual training records by linear interpolation

More Deep Insights of how SMOTE Algorithm work !

Step 1: Setting the minority class set A, for each [Tex]$x \in A$[/Tex], the k-nearest neighbors of x are obtained by calculating the Euclidean distance between x and every other sample in set A.

Step 2: The sampling rate N is set according to the imbalanced proportion. For each [Tex]$x \in A$[/Tex], N examples (i.e x1, x2, …xn) are randomly selected from its k-nearest neighbors, and they construct the set [Tex]$A_1$[/Tex] .

Step 3: For each example [Tex]$x_k \in A_1$[/Tex] (k=1, 2, 3…N), the following formula is used to generate a new example: [Tex]$x’ = x + rand(0, 1) * \mid x – x_k \mid$[/Tex] in which rand(0, 1) represents the random number between 0 and 1.

NearMiss Algorithm – Undersampling

information loss

near-neighbor

The basic intuition about the working of near-neighbor methods is as follows:

Step 1: The method first finds the distances between all instances of the majority class and the instances of the minority class. Here, majority class is to be under-sampled.

Step 2: Then, n instances of the majority class that have the smallest distances to those in the minority class are selected.

Step 3: If there are k instances in the minority class, the nearest method will result in k*n instances of the majority class.

For finding n closest instances in the majority class, there are several variations of applying NearMiss Algorithm :

NearMiss – Version 1 : It selects samples of the majority class for which average distances to the k closest instances of the minority class is smallest.
NearMiss – Version 2 : It selects samples of the majority class for which average distances to the k farthest instances of the minority class is smallest.
NearMiss – Version 3 : It works in 2 steps. Firstly, for each minority class instance, their M nearest-neighbors will be stored. Then finally, the majority class instances are selected for which the average distance to the N nearest-neighbors is the largest.

This article helps in better understanding and hands-on practice on how to choose best between different imbalanced data handling techniques.

Load libraries and data file

492 fraud transactions out of 284, 807 transactions

# import necessary modules  
import pandas  as pd 
import matplotlib.pyplot as plt 
import numpy as np 
from sklearn.linear_model import LogisticRegression 
from sklearn.preprocessing import StandardScaler 
from sklearn.metrics import confusion_matrix, classification_report 
  
# load the data set 
data = pd.read_csv('creditcard.csv') 
  
# print info about columns in the dataframe 
print(data.info()) 

Output:

RangeIndex: 284807 entries, 0 to 284806 Data columns (total 31 columns): Time 284807 non-null float64 V1 284807 non-null float64 V2 284807 non-null float64 V3 284807 non-null float64 V4 284807 non-null float64 V5 284807 non-null float64 V6 284807 non-null float64 V7 284807 non-null float64 V8 284807 non-null float64 V9 284807 non-null float64 V10 284807 non-null float64 V11 284807 non-null float64 V12 284807 non-null float64 V13 284807 non-null float64 V14 284807 non-null float64 V15 284807 non-null float64 V16 284807 non-null float64 V17 284807 non-null float64 V18 284807 non-null float64 V19 284807 non-null float64 V20 284807 non-null float64 V21 284807 non-null float64 V22 284807 non-null float64 V23 284807 non-null float64 V24 284807 non-null float64 V25 284807 non-null float64 V26 284807 non-null float64 V27 284807 non-null float64 V28 284807 non-null float64 Amount 284807 non-null float64 Class 284807 non-null int64

# normalise the amount column 
data['normAmount'] = StandardScaler().fit_transform(np.array(data['Amount']).reshape(-1, 1)) 
  
# drop Time and Amount columns as they are not relevant for prediction purpose  
data = data.drop(['Time', 'Amount'], axis = 1) 
  
# as you can see there are 492 fraud transactions. 
data['Class'].value_counts() 

Output:

0 284315 1 492

Split the data into test and train sets

from sklearn.model_selection import train_test_split 
  
# split into 70:30 ration 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 0) 
  
# describes info about train and test set 
print("Number transactions X_train dataset: ", X_train.shape) 
print("Number transactions y_train dataset: ", y_train.shape) 
print("Number transactions X_test dataset: ", X_test.shape) 
print("Number transactions y_test dataset: ", y_test.shape) 

Output:

Number transactions X_train dataset: (199364, 29) Number transactions y_train dataset: (199364, 1) Number transactions X_test dataset: (85443, 29) Number transactions y_test dataset: (85443, 1)

Now train the model without handling the imbalanced class distribution

# logistic regression object 
lr = LogisticRegression() 
  
# train the model on train set 
lr.fit(X_train, y_train.ravel()) 
  
predictions = lr.predict(X_test) 
  
# print classification report 
print(classification_report(y_test, predictions)) 

Output:

precision recall f1-score support 0 1.00 1.00 1.00 85296 1 0.88 0.62 0.73 147 accuracy 1.00 85443 macro avg 0.94 0.81 0.86 85443 weighted avg 1.00 1.00 1.00 85443

The accuracy comes out to be 100% but did you notice something strange ?

imbalanced data handling techniques

Using SMOTE Algorithm

print("Before OverSampling, counts of label '1': {}".format(sum(y_train == 1))) 
print("Before OverSampling, counts of label '0': {} \n".format(sum(y_train == 0))) 
  
# import SMOTE module from imblearn library 
# pip install imblearn (if you don't have imblearn in your system) 
from imblearn.over_sampling import SMOTE 
sm = SMOTE(random_state = 2) 
X_train_res, y_train_res = sm.fit_sample(X_train, y_train.ravel()) 
  
print('After OverSampling, the shape of train_X: {}'.format(X_train_res.shape)) 
print('After OverSampling, the shape of train_y: {} \n'.format(y_train_res.shape)) 
  
print("After OverSampling, counts of label '1': {}".format(sum(y_train_res == 1))) 
print("After OverSampling, counts of label '0': {}".format(sum(y_train_res == 0))) 

Output:

Before OverSampling, counts of label '1': [345] Before OverSampling, counts of label '0': [199019] After OverSampling, the shape of train_X: (398038, 29) After OverSampling, the shape of train_y: (398038, ) After OverSampling, counts of label '1': 199019 After OverSampling, counts of label '0': 199019

Look!

Prediction and Recall

lr1 = LogisticRegression() 
lr1.fit(X_train_res, y_train_res.ravel()) 
predictions = lr1.predict(X_test) 
  
# print classification report 
print(classification_report(y_test, predictions)) 

Output:

precision recall f1-score support 0 1.00 0.98 0.99 85296 1 0.06 0.92 0.11 147 accuracy 0.98 85443 macro avg 0.53 0.95 0.55 85443 weighted avg 1.00 0.98 0.99 85443

Wow

NearMiss Algorithm:

print("Before Undersampling, counts of label '1': {}".format(sum(y_train == 1))) 
print("Before Undersampling, counts of label '0': {} \n".format(sum(y_train == 0))) 
  
# apply near miss 
from imblearn.under_sampling import NearMiss 
nr = NearMiss() 
  
X_train_miss, y_train_miss = nr.fit_sample(X_train, y_train.ravel()) 
  
print('After Undersampling, the shape of train_X: {}'.format(X_train_miss.shape)) 
print('After Undersampling, the shape of train_y: {} \n'.format(y_train_miss.shape)) 
  
print("After Undersampling, counts of label '1': {}".format(sum(y_train_miss == 1))) 
print("After Undersampling, counts of label '0': {}".format(sum(y_train_miss == 0))) 

Output:

Before Undersampling, counts of label '1': [345] Before Undersampling, counts of label '0': [199019] After Undersampling, the shape of train_X: (690, 29) After Undersampling, the shape of train_y: (690, ) After Undersampling, counts of label '1': 345 After Undersampling, counts of label '0': 345

NearMiss Algorithm

Prediction and Recall

# train the model on train set 
lr2 = LogisticRegression() 
lr2.fit(X_train_miss, y_train_miss.ravel()) 
predictions = lr2.predict(X_test) 
  
# print classification report 
print(classification_report(y_test, predictions)) 

Output:

precision recall f1-score support 0 1.00 0.56 0.72 85296 1 0.00 0.95 0.01 147 accuracy 0.56 85443 macro avg 0.50 0.75 0.36 85443 weighted avg 1.00 0.56 0.72 85443

Tags:

#AI-ML-DS With Python #AI-ML-DS #Machine Learning #Machine Learning

Label Encoding in Python

Ordinary Least Squares (OLS) using statsmodels