ML | Handling Imbalanced Data with SMOTE and Near Miss Algorithm in Python
- SMOTE
- Near Miss Algorithm
SMOTE (Synthetic Minority Oversampling Technique) – Oversampling
NearMiss Algorithm – Undersampling
- NearMiss – Version 1 : It selects samples of the majority class for which average distances to the k closest instances of the minority class is smallest.
- NearMiss – Version 2 : It selects samples of the majority class for which average distances to the k farthest instances of the minority class is smallest.
- NearMiss – Version 3 : It works in 2 steps. Firstly, for each minority class instance, their M nearest-neighbors will be stored. Then finally, the majority class instances are selected for which the average distance to the N nearest-neighbors is the largest.
Load libraries and data file
# import necessary modules import pandas as pd import matplotlib.pyplot as plt import numpy as np from sklearn.linear_model import LogisticRegression from sklearn.preprocessing import StandardScaler from sklearn.metrics import confusion_matrix, classification_report # load the data set data = pd.read_csv( 'creditcard.csv' ) # print info about columns in the dataframe print (data.info()) |
RangeIndex: 284807 entries, 0 to 284806 Data columns (total 31 columns): Time 284807 non-null float64 V1 284807 non-null float64 V2 284807 non-null float64 V3 284807 non-null float64 V4 284807 non-null float64 V5 284807 non-null float64 V6 284807 non-null float64 V7 284807 non-null float64 V8 284807 non-null float64 V9 284807 non-null float64 V10 284807 non-null float64 V11 284807 non-null float64 V12 284807 non-null float64 V13 284807 non-null float64 V14 284807 non-null float64 V15 284807 non-null float64 V16 284807 non-null float64 V17 284807 non-null float64 V18 284807 non-null float64 V19 284807 non-null float64 V20 284807 non-null float64 V21 284807 non-null float64 V22 284807 non-null float64 V23 284807 non-null float64 V24 284807 non-null float64 V25 284807 non-null float64 V26 284807 non-null float64 V27 284807 non-null float64 V28 284807 non-null float64 Amount 284807 non-null float64 Class 284807 non-null int64
# normalise the amount column data[ 'normAmount' ] = StandardScaler().fit_transform(np.array(data[ 'Amount' ]).reshape( - 1 , 1 )) # drop Time and Amount columns as they are not relevant for prediction purpose data = data.drop([ 'Time' , 'Amount' ], axis = 1 ) # as you can see there are 492 fraud transactions. data[ 'Class' ].value_counts() |
0 284315 1 492
Split the data into test and train sets
from sklearn.model_selection import train_test_split # split into 70:30 ration X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3 , random_state = 0 ) # describes info about train and test set print ( "Number transactions X_train dataset: " , X_train.shape) print ( "Number transactions y_train dataset: " , y_train.shape) print ( "Number transactions X_test dataset: " , X_test.shape) print ( "Number transactions y_test dataset: " , y_test.shape) |
Number transactions X_train dataset: (199364, 29) Number transactions y_train dataset: (199364, 1) Number transactions X_test dataset: (85443, 29) Number transactions y_test dataset: (85443, 1)
Now train the model without handling the imbalanced class distribution
# logistic regression object lr = LogisticRegression() # train the model on train set lr.fit(X_train, y_train.ravel()) predictions = lr.predict(X_test) # print classification report print (classification_report(y_test, predictions)) |
precision recall f1-score support 0 1.00 1.00 1.00 85296 1 0.88 0.62 0.73 147 accuracy 1.00 85443 macro avg 0.94 0.81 0.86 85443 weighted avg 1.00 1.00 1.00 85443
Using SMOTE Algorithm
print ( "Before OverSampling, counts of label '1': {}" . format ( sum (y_train = = 1 ))) print ( "Before OverSampling, counts of label '0': {} \n" . format ( sum (y_train = = 0 ))) # import SMOTE module from imblearn library # pip install imblearn (if you don't have imblearn in your system) from imblearn.over_sampling import SMOTE sm = SMOTE(random_state = 2 ) X_train_res, y_train_res = sm.fit_sample(X_train, y_train.ravel()) print ( 'After OverSampling, the shape of train_X: {}' . format (X_train_res.shape)) print ( 'After OverSampling, the shape of train_y: {} \n' . format (y_train_res.shape)) print ( "After OverSampling, counts of label '1': {}" . format ( sum (y_train_res = = 1 ))) print ( "After OverSampling, counts of label '0': {}" . format ( sum (y_train_res = = 0 ))) |
Before OverSampling, counts of label '1': [345] Before OverSampling, counts of label '0': [199019] After OverSampling, the shape of train_X: (398038, 29) After OverSampling, the shape of train_y: (398038, ) After OverSampling, counts of label '1': 199019 After OverSampling, counts of label '0': 199019
Prediction and Recall
lr1 = LogisticRegression() lr1.fit(X_train_res, y_train_res.ravel()) predictions = lr1.predict(X_test) # print classification report print (classification_report(y_test, predictions)) |
precision recall f1-score support 0 1.00 0.98 0.99 85296 1 0.06 0.92 0.11 147 accuracy 0.98 85443 macro avg 0.53 0.95 0.55 85443 weighted avg 1.00 0.98 0.99 85443
NearMiss Algorithm:
print ( "Before Undersampling, counts of label '1': {}" . format ( sum (y_train = = 1 ))) print ( "Before Undersampling, counts of label '0': {} \n" . format ( sum (y_train = = 0 ))) # apply near miss from imblearn.under_sampling import NearMiss nr = NearMiss() X_train_miss, y_train_miss = nr.fit_sample(X_train, y_train.ravel()) print ( 'After Undersampling, the shape of train_X: {}' . format (X_train_miss.shape)) print ( 'After Undersampling, the shape of train_y: {} \n' . format (y_train_miss.shape)) print ( "After Undersampling, counts of label '1': {}" . format ( sum (y_train_miss = = 1 ))) print ( "After Undersampling, counts of label '0': {}" . format ( sum (y_train_miss = = 0 ))) |
Before Undersampling, counts of label '1': [345] Before Undersampling, counts of label '0': [199019] After Undersampling, the shape of train_X: (690, 29) After Undersampling, the shape of train_y: (690, ) After Undersampling, counts of label '1': 345 After Undersampling, counts of label '0': 345
Prediction and Recall
# train the model on train set lr2 = LogisticRegression() lr2.fit(X_train_miss, y_train_miss.ravel()) predictions = lr2.predict(X_test) # print classification report print (classification_report(y_test, predictions)) |
precision recall f1-score support 0 1.00 0.56 0.72 85296 1 0.00 0.95 0.01 147 accuracy 0.56 85443 macro avg 0.50 0.75 0.36 85443 weighted avg 1.00 0.56 0.72 85443