Implementation with Isolation Forest

In this section, we are going to delve into the implementation of Isolation Forest. We are going to perform anomaly detection on credit card transaction using the algorithm using the following steps:

Step 1: Importing required libraries

Python3

# Import necessary libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.ensemble import IsolationForest
from sklearn.metrics import accuracy_score, classification_report
from sklearn.preprocessing import StandardScaler

Step 2: Dataset loading and pre-processing

Now we will load very famous Credit Card Anomaly detection dataset and limit its row count to 40000 for faster processing speed. Then we will standardize the features in the dataset excluding the target variable ‘Class’ using StandardScaler, ensuring that each feature has a mean of 0 and a standard deviation of 1. Next, it selects the first 40,000 rows of the standardized data and converts it into a Data Frame. Finally, it separates the features (X) from the target variable (y), where ‘X’ contains all columns except ‘Class’, and ‘y’ contains only the ‘Class’ column indicating the transaction’s fraud status.

Python3

credit_data = pd.read_csv('creditcard.csv', nrows=40000) # https://www.kaggle.com/mlg-ulb/creditcardfraud
scaler = StandardScaler().fit_transform(credit_data.loc[:,credit_data.columns!='Class'])
scaled_data = scaler[0:40000]
df = pd.DataFrame(data=scaled_data)
# Separate features and target variable
X = credit_data.drop(columns=['Class'])
y = credit_data['Class']

Defining Isolation Forest model

Now it is time to train our Isolation Forest model. Firstly, the fraction of outliers in the dataset is determined by calculating the ratio of fraudulent transactions (‘Class’ equals 1) to non-fraudulent transactions (‘Class’ equals 0). Subsequently, an Isolation Forest model is created and fitted to the data. The hyperparameters for the Isolation Forest model are defined as follows–> ‘n_estimators’ is set to 100, indicating the number of base estimators in the ensemble, and ‘contamination’ is assigned the previously calculated outlier fraction, representing the expected proportion of outliers in the dataset. Additionally, ‘random_state’ is used for reproducibility.

Python3

# Determine the fraction of outliers
outlier_fraction = len(credit_data[credit_data['Class']==1])/float(len(credit_data[credit_data['Class']==0]))
# Create and fit the Isolation Forest model
model =  IsolationForest(n_estimators=100, contamination=outlier_fraction, random_state=42)
model.fit(df)

Output:

IsolationForest(contamination=0.0026067776218167233, random_state=42)

Model evaluation

Now we will evaluate our model on the basis of how much accurately our model is separating the outliers or potential anomalies present in the dataset. So, here we will calculate the anomaly score from model’s decision boundary function then print Accuracy of it.

Python3

# Predict outliers
scores_prediction = model.decision_function(df)
y_pred = model.predict(df)
y_pred[y_pred == 1] = 0
y_pred[y_pred == -1] = 1
# Print the accuracy in separating outliers or anomalies
print("Accuracy in finding anomaly:",accuracy_score(y,y_pred))

Output:

Accuracy in finding anomaly: 0.997175

So, we have achived above 99% of accuracy.

Comparative visualization

Now we will plot the normal vs. anomalous instances of any feature of the dataset. Here we will plot the ‘Amount’ feature of the dataset but you can just change the name of the feature to visualize that feature’s results.

Python3

# Selecting the feature for y-axis
y_feature = credit_data['Amount']    # change the feature name to visualize another

# Adding the predicted labels to the original dataset
credit_data['predicted_class'] = y_pred

# Plotting the graph
plt.figure(figsize=(7, 4))
sns.scatterplot(x=credit_data.index, y=y_feature, hue=credit_data['predicted_class'], palette={0: 'blue', 1: 'red'}, s=50)
plt.title('Visualization of Normal vs Anomalous Transactions')
plt.xlabel('Data points')
plt.ylabel(y_feature.name)
plt.legend(title='Predicted Class', loc='best')
plt.show()

Output:

From the above plot, we can clearly see that the normal instances and anomalous instances are separated in well manner with very little overlap.

What is Isolation Forest?

Isolation forest is a state-of-the-art anomaly detection algorithm which is very famous for its efficiency and simplicity. By removing anomalies from a dataset using binary partitioning, it quickly identifies outliers with minimal computational overhead, making it the way to go for anomalies in areas ranging from cybersecurity to finance. In this article, we are going to explore the fundamentals of Isolation Forest algorithm.

Table of Content

What is Isolation Forest?
How Isolation forest Algorithm Works?
Implementation with Isolation Forest
Advantages of Isolation Forest
Limitations of Isolation Forest

Similar Reads

Efficiency and flexibility: Isolation Forest exhibits remarkable robustness especially in high-dimensional datasets due to its ability to remove anomalies through random splitting. Unlike traditional methods like k-means or hierarchical clustering, it does not have to Isolation Forest calculates the distance between data points also remains small, which makes it highly scalable for real-time anomaly detection tasks.Tolerance for outliers: One of Isolation Forest’s most notable strengths is its tolerance for outliers. By design, the algorithm excels at reducing anomalies by performing separations that separate repeated data points. This makes it particularly effective in cases where the anomalies are small or show distinct differences from the norm. Furthermore, since forest segmentation does not rely on distance-based methods, it is less susceptible to the effects of outliers, ensuring reliable anomaly detection performance with different data sets in various fieldsEase of implementation and interpretation: Isolation is quite straightforward to implement, due to its simple design and minimal overhead. The simplicity of the algorithm makes it easy for lack of labor more machine learning capabilities, allowing for rapid deployment in a variety of applications. Furthermore, the binary partitioning nature of Isolation Forest facilitates interpretability, as anomalies are identified based on their isolation paths within the constructed trees. This transparency enhances trust in the detection results and facilitates post-analysis interpretation for decision-making.Handling High-Dimensional Data: Isolation Forest excels in handling high-dimensional data, which poses challenges for many traditional anomaly detection techniques. By randomly selecting features for partitioning, the algorithm effectively mitigates the curse of dimensionality, maintaining robust performance even in datasets with numerous variables. This makes Isolation Forest well-suited for applications such as image processing, text mining, and sensor data analysis, where datasets often exhibit complex, multidimensional structures....

Tags:

#AI-ML-DS With Python #AI-ML-DS #Machine Learning #Machine Learning

How Isolation forest Algorithm Works?

Advantages of Isolation Forest