ML | Handling Imbalanced Data with SMOTE and Near Miss Algorithm in Python

Imbalanced Data Distribution
Fraud Detection
Anomaly Detection
Facial recognition
majority
recall
Imbalanced Data Handling Techniques:
  1. SMOTE
  2. Near Miss Algorithm

SMOTE (Synthetic Minority Oversampling Technique) – Oversampling

virtual training records by linear interpolation
More Deep Insights of how SMOTE Algorithm work !
  • Step 1: Setting the minority class set A, for each [Tex]$x \in A$[/Tex], the k-nearest neighbors of x are obtained by calculating the Euclidean distance between x and every other sample in set A.
  • Step 2: The sampling rate N is set according to the imbalanced proportion. For each [Tex]$x \in A$[/Tex], N examples (i.e x1, x2, …xn) are randomly selected from its k-nearest neighbors, and they construct the set [Tex]$A_1$[/Tex] .
  • Step 3: For each example [Tex]$x_k \in A_1$[/Tex] (k=1, 2, 3…N), the following formula is used to generate a new example: [Tex]$x’ = x + rand(0, 1) * \mid x – x_k \mid$[/Tex] in which rand(0, 1) represents the random number between 0 and 1.


  • NearMiss Algorithm – Undersampling

    information loss
    near-neighbor
    The basic intuition about the working of near-neighbor methods is as follows:
  • Step 1: The method first finds the distances between all instances of the majority class and the instances of the minority class. Here, majority class is to be under-sampled.
  • Step 2: Then, n instances of the majority class that have the smallest distances to those in the minority class are selected.
  • Step 3: If there are k instances in the minority class, the nearest method will result in k*n instances of the majority class.
  • For finding n closest instances in the majority class, there are several variations of applying NearMiss Algorithm :
    1. NearMiss – Version 1 : It selects samples of the majority class for which average distances to the k closest instances of the minority class is smallest.
    2. NearMiss – Version 2 : It selects samples of the majority class for which average distances to the k farthest instances of the minority class is smallest.
    3. NearMiss – Version 3 : It works in 2 steps. Firstly, for each minority class instance, their M nearest-neighbors will be stored. Then finally, the majority class instances are selected for which the average distance to the N nearest-neighbors is the largest.
    This article helps in better understanding and hands-on practice on how to choose best between different imbalanced data handling techniques.

    Load libraries and data file

    492 fraud transactions out of 284, 807 transactions
    # import necessary modules 
    import pandas  as pd
    import matplotlib.pyplot as plt
    import numpy as np
    from sklearn.linear_model import LogisticRegression
    from sklearn.preprocessing import StandardScaler
    from sklearn.metrics import confusion_matrix, classification_report
      
    # load the data set
    data = pd.read_csv('creditcard.csv')
      
    # print info about columns in the dataframe
    print(data.info())

    Output:

    RangeIndex: 284807 entries, 0 to 284806 Data columns (total 31 columns): Time 284807 non-null float64 V1 284807 non-null float64 V2 284807 non-null float64 V3 284807 non-null float64 V4 284807 non-null float64 V5 284807 non-null float64 V6 284807 non-null float64 V7 284807 non-null float64 V8 284807 non-null float64 V9 284807 non-null float64 V10 284807 non-null float64 V11 284807 non-null float64 V12 284807 non-null float64 V13 284807 non-null float64 V14 284807 non-null float64 V15 284807 non-null float64 V16 284807 non-null float64 V17 284807 non-null float64 V18 284807 non-null float64 V19 284807 non-null float64 V20 284807 non-null float64 V21 284807 non-null float64 V22 284807 non-null float64 V23 284807 non-null float64 V24 284807 non-null float64 V25 284807 non-null float64 V26 284807 non-null float64 V27 284807 non-null float64 V28 284807 non-null float64 Amount 284807 non-null float64 Class 284807 non-null int64

    # normalise the amount column
    data['normAmount'] = StandardScaler().fit_transform(np.array(data['Amount']).reshape(-1, 1))
      
    # drop Time and Amount columns as they are not relevant for prediction purpose 
    data = data.drop(['Time', 'Amount'], axis = 1)
      
    # as you can see there are 492 fraud transactions.
    data['Class'].value_counts()

    Output:

    0 284315 1 492

    Split the data into test and train sets

    from sklearn.model_selection import train_test_split
      
    # split into 70:30 ration
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 0)
      
    # describes info about train and test set
    print("Number transactions X_train dataset: ", X_train.shape)
    print("Number transactions y_train dataset: ", y_train.shape)
    print("Number transactions X_test dataset: ", X_test.shape)
    print("Number transactions y_test dataset: ", y_test.shape)

    Output:

    Number transactions X_train dataset: (199364, 29) Number transactions y_train dataset: (199364, 1) Number transactions X_test dataset: (85443, 29) Number transactions y_test dataset: (85443, 1)

    Now train the model without handling the imbalanced class distribution

    # logistic regression object
    lr = LogisticRegression()
      
    # train the model on train set
    lr.fit(X_train, y_train.ravel())
      
    predictions = lr.predict(X_test)
      
    # print classification report
    print(classification_report(y_test, predictions))

    Output:

    precision recall f1-score support 0 1.00 1.00 1.00 85296 1 0.88 0.62 0.73 147 accuracy 1.00 85443 macro avg 0.94 0.81 0.86 85443 weighted avg 1.00 1.00 1.00 85443

    The accuracy comes out to be 100% but did you notice something strange ?
    imbalanced data handling techniques

    Using SMOTE Algorithm

    print("Before OverSampling, counts of label '1': {}".format(sum(y_train == 1)))
    print("Before OverSampling, counts of label '0': {} \n".format(sum(y_train == 0)))
      
    # import SMOTE module from imblearn library
    # pip install imblearn (if you don't have imblearn in your system)
    from imblearn.over_sampling import SMOTE
    sm = SMOTE(random_state = 2)
    X_train_res, y_train_res = sm.fit_sample(X_train, y_train.ravel())
      
    print('After OverSampling, the shape of train_X: {}'.format(X_train_res.shape))
    print('After OverSampling, the shape of train_y: {} \n'.format(y_train_res.shape))
      
    print("After OverSampling, counts of label '1': {}".format(sum(y_train_res == 1)))
    print("After OverSampling, counts of label '0': {}".format(sum(y_train_res == 0)))

    Output:

    Before OverSampling, counts of label '1': [345] Before OverSampling, counts of label '0': [199019] After OverSampling, the shape of train_X: (398038, 29) After OverSampling, the shape of train_y: (398038, ) After OverSampling, counts of label '1': 199019 After OverSampling, counts of label '0': 199019

    Look!

    Prediction and Recall

    lr1 = LogisticRegression()
    lr1.fit(X_train_res, y_train_res.ravel())
    predictions = lr1.predict(X_test)
      
    # print classification report
    print(classification_report(y_test, predictions))

    Output:

    precision recall f1-score support 0 1.00 0.98 0.99 85296 1 0.06 0.92 0.11 147 accuracy 0.98 85443 macro avg 0.53 0.95 0.55 85443 weighted avg 1.00 0.98 0.99 85443

    Wow

    NearMiss Algorithm:

    print("Before Undersampling, counts of label '1': {}".format(sum(y_train == 1)))
    print("Before Undersampling, counts of label '0': {} \n".format(sum(y_train == 0)))
      
    # apply near miss
    from imblearn.under_sampling import NearMiss
    nr = NearMiss()
      
    X_train_miss, y_train_miss = nr.fit_sample(X_train, y_train.ravel())
      
    print('After Undersampling, the shape of train_X: {}'.format(X_train_miss.shape))
    print('After Undersampling, the shape of train_y: {} \n'.format(y_train_miss.shape))
      
    print("After Undersampling, counts of label '1': {}".format(sum(y_train_miss == 1)))
    print("After Undersampling, counts of label '0': {}".format(sum(y_train_miss == 0)))

    Output:

    Before Undersampling, counts of label '1': [345] Before Undersampling, counts of label '0': [199019] After Undersampling, the shape of train_X: (690, 29) After Undersampling, the shape of train_y: (690, ) After Undersampling, counts of label '1': 345 After Undersampling, counts of label '0': 345

    NearMiss Algorithm

    Prediction and Recall

    # train the model on train set
    lr2 = LogisticRegression()
    lr2.fit(X_train_miss, y_train_miss.ravel())
    predictions = lr2.predict(X_test)
      
    # print classification report
    print(classification_report(y_test, predictions))

    Output:

    precision recall f1-score support 0 1.00 0.56 0.72 85296 1 0.00 0.95 0.01 147 accuracy 0.56 85443 macro avg 0.50 0.75 0.36 85443 weighted avg 1.00 0.56 0.72 85443