Python Implementation of Resampling Techniques

We will apply both undersampling and oversampling our dataset for balancing our target variable

Step 1: Import Libraries

Python3

import pandas as pd 
import numpy as np 
import seaborn as sns 
from sklearn.preprocessing import StandardScaler 
from imblearn.under_sampling import RandomUnderSampler, TomekLinks 
from imblearn.over_sampling import RandomOverSampler, SMOTE

Step 2: Reading the Dataset

We will read the dataset using the pandas read_csv function. Also, we will see the percentage of each class in the target variable of the dataset.

Python3

dataset = pd.read_csv('creditcard.csv') 
  
  
print("The Number of Samples in the dataset: ", len(dataset)) 
print('Class 0        :', round(dataset['Class'].value_counts()[0] 
                                / len(dataset) * 100, 2), '% of the dataset') 
  
print('Class 1(Fraud) :', round(dataset['Class'].value_counts()[1] 
                                / len(dataset) * 100, 2), '% of the dataset')

Output:

The Number of Samples in the dataset:  284807
Class 0        : 99.83 % of the dataset
Class 1(Fraud) : 0.17 % of the dataset

Step3: Undersampling of Major class

We will do an undersampling of the major class where the cardholder is not a fraud through this technique we will reduce the rows which come under the major class.

Python3

X_data = dataset.iloc[:, :-1] 
Y_data = dataset.iloc[:, -1:] 
  
rus = RandomUnderSampler(random_state=42) 
X_res, y_res = rus.fit_resample(X_data, Y_data) 
  
X_res = pd.DataFrame(X_res) 
Y_res = pd.DataFrame(y_res) 
  
print("After Under Sampling Of Major  
         Class Total Samples are :", len(Y_res)) 
print('Class 0        :', round(Y_res.value_counts()\ 
                    [0] / len(Y_res) * 100, 2), '% of the dataset') 
print('Class 1(Fraud) :', round(Y_res.value_counts()\ 
                    [1] / len(Y_res) * 100, 2), '% of the dataset') 

Output:

After Under Sampling Of Major Class Total Samples are : 984
Class 0        : 50.0 % of the dataset
Class 1(Fraud) : 50.0 % of the dataset

We can see after doing undersampling the total data in the major class has reduced to 984.

Step4: Undersampling Using Tomelinks

We can do undersampling using the Tomelinks library.

Python3

tl = TomekLinks() 
  
X_res, y_res = tl.fit_resample(X_data, Y_data) 
  
X_res = pd.DataFrame(X_res) 
Y_res = pd.DataFrame(y_res) 
  
  
print("After TomekLinks Under Sampling Of Major\ 
        Class Total Samples are :", len(Y_res)) 
print('Class 0        :', round(Y_res.value_counts()\ 
                     [0] / len(Y_res) * 100, 2), '% of the dataset') 
print('Class 1(Fraud) :', round(Y_res.value_counts()\ 
                     [1] / len(Y_res) * 100, 2), '% of the dataset') 

Output:

After TomekLinks Under Sampling Of Major Class Total Samples are : 284736
Class 0        : 99.83 % of the dataset
Class 1(Fraud) : 0.17 % of the dataset

Step5: Oversampling Using RandomOversampler

We can use RandomOversampler to oversample the minority class data. Using Random Oversample the model picks randomly data points from the existing datasets.

Python3

ros = RandomOverSampler(random_state=42) 
  
X_res, y_res = ros.fit_resample(X_data, Y_data) 
  
X_res = pd.DataFrame(X_res) 
Y_res = pd.DataFrame(y_res) 
  
  
print("After Over Sampling Of Minor Class\ 
              Total Samples are :", len(Y_res)) 
print('Class 0        :', round(Y_res.value_counts()\ 
                  [0] / len(Y_res) * 100, 2), '% of the dataset') 
print('Class 1(Fraud) :', round(Y_res.value_counts()\ 
                    [1] / len(Y_res) * 100, 2), '% of the dataset') 

Output :

After Over Sampling Of Minor Class Total Samples are : 568630
Class 0        : 50.0 % of the dataset
Class 1(Fraud) : 50.0 % of the dataset

Step6: Oversampling Using SMOTE

We can use Smote to generate random sample data for the minority class. One useful thing about using SMOTE is that it creates new synthetic data points for the minority class.

Python3

sm = SMOTE(random_state=42) 
  
X_res, y_res = sm.fit_resample(X_data, Y_data) 
  
X_res = pd.DataFrame(X_res) 
Y_res = pd.DataFrame(y_res) 
  
  
print("After SMOTE Over Sampling Of Minor\ 
            Class Total Samples are :", len(Y_res)) 
print('Class 0        :', round(Y_res.value_counts()\ 
                    [0] / len(Y_res) * 100, 2), '% of the dataset') 
print('Class 1(Fraud) :', round(Y_res.value_counts()\ 
                     [1] / len(Y_res) * 100, 2), '% of the dataset') 

Output:

After SMOTE Over Sampling Of Minor Class Total Samples are : 568630
Class 0        : 50.0 % of the dataset
Class 1(Fraud) : 50.0 % of the dataset

Introduction to Resampling methods

While reading about Machine Learning and Data Science we often come across a term called Imbalanced Class Distribution, which generally happens when observations in one of the classes are much higher or lower than in other classes.
As Machine Learning algorithms tend to increase accuracy by reducing the error, they do not consider the class distribution. This problem is prevalent in examples such as Fraud Detection, Anomaly Detection, Facial recognition, etc.

Tags:

#AI-ML-DS With Python #AI-ML-DS #Machine Learning #Machine Learning

What is Resampling Method

Python Implementation of Resampling Techniques

Step 1: Import Libraries

Python3

Step 2: Reading the Dataset

Python3

Step3: Undersampling of Major class

Python3

Step4: Undersampling Using Tomelinks

Python3

Step5: Oversampling Using RandomOversampler

Python3

Step6: Oversampling Using SMOTE

Python3

Introduction to Resampling methods

Similar Reads