Step-by-Step implementation of One-Class Support Vector Machines in Python
Importing required modules
At first, we will import all required Python libraries like Pandas, NumPy, Matplotlib and SKlearn etc.
Python3
# Import necessary libraries import pandas as pd import numpy as np import matplotlib.pyplot as plt from sklearn.model_selection import train_test_split from sklearn.svm import OneClassSVM from sklearn.metrics import accuracy_score from sklearn.preprocessing import StandardScaler |
Dataset loading and preprocessing
Now we will load the famous credit card dataset. For faster implementation we will use first 50k rows of the dataset. Then we will use Standard Scaler to scale the target column. Then we will separate the features and target variable for further usages.
Python3
credit_data = pd.read_csv( 'creditcard.csv' , nrows = 50000 ) # https://www.kaggle.com/mlg-ulb/creditcardfraud standardized_data_without_class = StandardScaler().fit_transform(credit_data.loc[:,credit_data.columns! = 'Class' ]) data_50k_new = standardized_data_without_class[ 0 : 50000 ] data_50k_df = pd.DataFrame(data = data_50k_new) # Separate features and target variable X = credit_data.drop(columns = [ 'Class' ]) y = credit_data[ 'Class' ] |
Model training
Now we will train the One-class SVM on various hyperparameters which are discussed below:
kernel
: The choice of the kernel determines the transformation applied to the input data in a higher-dimensional space. Here we have set to default “rbf” which stands for Radial Basis Function, commonly known as the Gaussian kernel. This kernel is suitable for capturing complex, non-linear relationships in the data.degree
: We have set it to default value 3. It defines the degree of the polynomial function and is particularly applicable when the kernel is set to “poly.” However, if the patterns of dataset act as polynomial then this parameter automatically handles the kernel as required.gamma
: is a crucial parameter that influences the shape of the decision boundary. A smaller gamma value results in a broader decision boundary which makes the model less sensitive to individual data points. Conversely, a larger gamma value leads to a more complex decision boundary, potentially capturing intricate patterns in the data. Fine-tuning gamma is essential for achieving optimal model performance.nu
: It represents an upper bound on the fraction of margin errors and support vectors. It allows users to control the balance between precision and recall in the model. A smaller nu value makes the algorithm more lenient, permitting a higher fraction of margin errors and support vectors, which can be useful in scenarios with a considerable number of anomalies.
Python3
clf_svm = OneClassSVM(kernel = "rbf" , degree = 3 , gamma = 0.1 , nu = 0.01 ) y_predict = clf_svm.fit_predict(data_50k_df) |
Model evaluation
Here model evaluation is little different from other traditional ML models. This model can be evaluated by accuracy by not for performance of classification rather for outlier or anomalies detection or logically separation.
Python3
svm_predict = pd.Series(y_predict).replace([ - 1 , 1 ],[ 1 , 0 ]) svm_anomalies = data_50k_df[svm_predict = = 1 ] # Calculate accuracy accuracy = accuracy_score(y, svm_predict) print ( "Accuracy in separating Outlier:" , accuracy) |
Output:
Accuracy in separating Outlier: 0.9641
Visualizing detected outliers(anomalies)
Now we will plot the inlier and outlier plots between any two features. To do this we will define a small function(plot_OCSVM) which can plot any features with outliers any per our choice. By just changing the integer value during function calling we can visualize them.
Python3
def plot_OCSVM(i): plt.scatter(data_50k_df.iloc[:,i],data_50k_df.iloc[:,i + 1 ],c = 'red' ,s = 40 , edgecolor = "k" ) plt.scatter(svm_anomalies.iloc[:,i],svm_anomalies.iloc[:,i + 1 ],c = 'green' , s = 40 , edgecolor = "k" ) plt.title( "OC-SVM Outlier detection between Feature Pair: V{} and V{}" . format (i,i + 1 )) plt.xlabel( "V{}" . format (i)) plt.ylabel( "V{}" . format (i + 1 )) #plot_OCSVM(1) # chnage the integer value to visualize different pairs of features plot_OCSVM( 2 ) #plot_OCSVM(3) |
Output:
So, from the above plot we can clearly see the One-class SVM has sharply separated the normal occurrences with anomalies(potentially outliers) for both the features. We can also visualize other features by calling the function with different values.
Understanding One-Class Support Vector Machines
Support Vector Machine is a popular supervised machine learning algorithm. it is used for both classifications and regression. In this article, we will discuss One-Class Support Vector Machines model.