Detection and Handling Outliers : Implementation

Step 1: Import the necessary libraries and load the dataset

Python3

import numpy as np
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
#load the dataset
iris = load_iris()
X,y = iris.data, iris.target

Step 2: Introducing outliers randomly to the dataset

In this step, we introduce the outliers randomly to the dataset, since it is common method in outlier detection techniques. Ten indices are randomly chosen and adds the random noise to it, this simulates the presence of outliers in the real-world datasets.

Python3

np.random.seed(0)
outlier_indices = np.random.choice(range(len(X)), size = 10, replace=False)
X[outlier_indices]+= 50 * np.random.rand(10,4)

Step 3: Creating datasets with specified outlier treatment

Single-case deletion approach: In this approach we removed the outliers one by one from the dataset. We define the function that creates the copies of original X and y arrays that iteratively removes the outliers specified by the outlier_indicies.
Multiple-case deletion approach: In this approach all the outliers and their immediate neighbors are removed from the datasets, this helps to ensure that the data points surrounded by the outliers are not considered while training the model. By this approach we tend to get a cleaner dataset that can possibly improve the performance of the model.

Python3

def create_dataset(X, y, outlier_treatment, outlier_indices):
  #Single case
  if outlier_treatment == "single":
    X_no_outliers= np.copy(X)
    y_no_outliers = np.copy(y)
    for idx in outlier_indices:
      X_no_outliers = np.delete(X_no_outliers, idx, axis =0)
      y_no_outliers = np.delete(y_no_outliers, idx)
    X_train, X_test, y_train, y_test = train_test_split(X_no_outliers, y_no_outliers, test_size=0.2, random_state=42)
  #multiple
  elif outlier_treatment == "multiple":
    #remove all the outliers
    outlier_indices = np.concatenate((outlier_indices, outlier_indices + 1, outlier_indices+2))
    X_no_outliers =np.delete(X, outlier_indices, axis=0)
    y_no_outliers = np.delete(y, outlier_indices)
    #Split into training and testing dataset
    X_train, X_test, y_train, y_test = train_test_split(X_no_outliers, y_no_outliers, test_size =0.2, random_state=42)
  return X_train, X_test, y_train, y_test

Step 4: Training logistic regression

Python3

def train_logistic_regression(X_train, X_test, y_train, y_test):
  lr= LogisticRegression(max_iter=1000)
  lr.fit(X_train, y_train)
  y_pred = lr.predict(X_test)
  acc=accuracy_score(y_test,y_pred)
  return acc

Step 5: Evaluation

Using create_dataset function we call the function with single and multiple outlier treatment depending upon the specified approach. This function creates the datasets where outliers are handles differently based on the approach chosen.

Python3

X_train_single, X_test_single, y_train_single, y_test_single = create_dataset(X, y, "single", outlier_indices)
acc_single = train_logistic_regression(X_train_single, X_test_single, y_train_single, y_test_single)
X_train_multiple, X_test_multiple, y_train_multiple, y_test_multiple = create_dataset(X, y, "multiple", outlier_indices)
acc_multiple = train_logistic_regression(X_train_multiple, X_test_multiple, y_train_multiple, y_test_multiple)
print("Accuracy using single case deletion approach:", acc_single)
print("Accuracy using multiple case deletion approach:", acc_multiple)

Output:

Accuracy using single case deletion approach: 0.7857142857142857
Accuracy using multiple case deletion approach: 0.9583333333333334

Single-case deletion approach: This approach involves removing outliers one by one from the datasets and once it removed all the outliers, the model is trained using the modified dataset. In this case, our single-case deletion approach model yields the accuracy of 0.7857, thus indicating that the model’s performance is relatively lower when the outliers are handled individually
Multiple-case deletion approach: This approach removes the outliers in batches or groups. In this case, out multiple-case deletion approach model yields the accuracy of 0.9583, thus indicating that the model’s performance improves significantly when the outliers are removed in groups.

Thus, we can clearly see that multiple-case deletion approach is more effective in handling outliers compared to the single case deletion approach since it leads to a higher accuracy in the trained model.

Outlier Detection in Logistic Regression

Outliers, data points that deviate significantly from the rest, can significantly impact the performance of logistic regression models. In this article we will explore various techniques for detecting and handling outliers in Logistic regression.

Tags:

#AI-ML-DS With Python #AI-ML-DS #Machine Learning #Machine Learning

Handling Outliers

Challenges of Outlier Detection

Detection and Handling Outliers : Implementation

Step 1: Import the necessary libraries and load the dataset

Step 2: Introducing outliers randomly to the dataset

Step 3: Creating datasets with specified outlier treatment

Step 4: Training logistic regression

Step 5: Evaluation

Outlier Detection in Logistic Regression

Similar Reads