Anomaly detection using Isolation Forest: Implementation
Let’s see implementation for Isolation Forest algorithm for anomaly detection using the Iris flower dataset from scikit-learn. In the context of the Iris flower dataset, the outliers would be data points that do not correspond to any of the three known Iris flower species (Iris Setosa, Iris Versicolor, and Iris Virginica). The following steps are mentioned:
Step 1: Import necessary libraries
from sklearn.ensemble import IsolationForest
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_iris
import matplotlib.pyplot as plt
Step 2: Loading and Splitting the Dataset
iris = load_iris()
X = iris.data
y = iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
Step 3: Fitting the model
This code creates an Isolation Forest classifier instance using the IsolationForest
class. Contamination
is a parameter that specifies the expected proportion of anomalies in the data. Here, it’s set to 0.1 (10%).
# initialize and fit the model
clf = IsolationForest(contamination=0.1)
clf.fit(X_train)
Step 4: Predictions
The predict
method returns labels indicating whether each data point is classified as normal (1) or anomalous (-1) by the model.
# predict the anomalies in the data
y_pred_train = clf.predict(X_train)
y_pred_test = clf.predict(X_test)
print(y_pred_train)
print(y_pred_test)
Output:
[ 1 1 1 1 -1 1 -1 1 1 -1 1 1 1 1 -1 1 1 1 1 1 1 1 -1 1
1 1 -1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 -1 1 1 -1 1 1 1 1 1 1 1 1 1 -1 1 1 1 1 1 1 1 1 1
1 1 1 1 -1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 -1 1 1]
[ 1 -1 -1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
-1 1 1 1 1 1 1 -1 1 1 1 1 1 1 1 -1 1 1 1 -1 1]
Step 4: Visualization
def create_scatter_plots(X1, y1, title1, X2, y2, title2):
fig, axes = plt.subplots(1, 2, figsize=(12, 6))
# Scatter plot for the first set of data
axes[0].scatter(X1[y1==1, 0], X1[y1==1, 1], color='green', label='Normal')
axes[0].scatter(X1[y1==-1, 0], X1[y1==-1, 1], color='red', label='Anomaly')
axes[0].set_title(title1)
axes[0].legend()
# Scatter plot for the second set of data
axes[1].scatter(X2[y2==1, 0], X2[y2==1, 1], color='green', label='Normal')
axes[1].scatter(X2[y2==-1, 0], X2[y2==-1, 1], color='red', label='Anomaly')
axes[1].set_title(title2)
axes[1].legend()
plt.tight_layout()
plt.show()
# scatter plots
create_scatter_plots(X_train, y_pred_train, 'Training Data', X_test, y_pred_test, 'Test Data')
Output:
The distribution of the anomalies in the training data is different from the distribution of the anomalies in the test data. In the training data, the anomalies tend to be located on the edges of the plot. In the test data, the anomalies are more scattered throughout the plot.
Anomaly detection using Isolation Forest
Anomaly detection is vital across industries, revealing outliers in data that signal problems or unique insights. Isolation Forests offer a powerful solution, isolating anomalies from normal data. In this tutorial, we will explore the Isolation Forest algorithm’s implementation for anomaly detection using the Iris flower dataset, showcasing its effectiveness in identifying outliers amidst multidimensional data.