Metrics for Over-fitting Detection

Over-fitting is a common problem where model performs well on the training data but poorly on unseen data. CatBoost provides metrics to assess over-fitting.

1. Cross-Validation

Cross-validation is a crucial technique in machine learning that helps in detecting and mitigating overfitting. Cross-validation helps in overfitting detection by comparing training and validation performance.

Python3

import numpy as np
from catboost import CatBoostClassifier, Pool, cv
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
 
# Load the Iris dataset
iris = load_iris()
X, y = iris.data, iris.target
 
# Create a CatBoost Pool object
data_pool = Pool(X, label=y)
 
# Specify the CatBoostClassifier parameters
params = {
    'iterations': 100,           # Number of boosting iterations
    'learning_rate': 0.1,       # Learning rate
    'depth': 6,                 # Depth of the trees
    'loss_function': 'MultiClass',  # Loss function for multi-class classification
    'verbose': 0                # Set verbose to 0 for less output
}
 
# Perform cross-validation
cv_results = cv(pool=data_pool, 
                params=params, 
                fold_count=5, 
                shuffle=True, 
                partition_random_seed=42, 
                verbose_eval=False)
 
# Print the results
for metric_name in cv_results.columns:
    if 'test-' in metric_name:
        mean_score = cv_results[metric_name].iloc[-1]
        print(f'{metric_name}: {mean_score:.4f}')

Output:

Training on fold [0/5]

bestTest = 0.1226007055
bestIteration = 72

Training on fold [1/5]

bestTest = 0.09388296402
bestIteration = 99

Training on fold [2/5]

bestTest = 0.05707644554
bestIteration = 99

Training on fold [3/5]

bestTest = 0.1341533772
bestIteration = 93

Training on fold [4/5]

bestTest = 0.19934632
bestIteration = 94

test-MultiClass-mean: 0.1221
test-MultiClass-std: 0.0531

This code performs cross-validation for a CatBoostClassifier model on the Iris dataset, allowing to assess the model’s performance using multiple evaluation metrics. It’s a common practice to use cross-validation to get a more robust estimate of a model’s performance and to avoid overfitting.

2. Feature Importance

CatBoost also offers a feature importance score. It can be used to identify the priority and importance of features and their impact on the model’s prediction. Feature importance can be used to detect overfitting by identifying features that are not important to the model’s predictions. If a feature is not important to the model’s predictions, it is likely that the model is overfitting to that feature.

Python3

import numpy as np
import matplotlib.pyplot as plt
from catboost import CatBoostClassifier, Pool
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
 
# Load the Iris dataset
iris = load_iris()
X, y = iris.data, iris.target
 
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    test_size=0.2, 
                                                    random_state=42)
 
# Create a CatBoostClassifier
model = CatBoostClassifier(iterations=100, 
                           learning_rate=0.1, 
                           depth=6, 
                           loss_function='MultiClass', 
                           verbose=0)
 
# Train the model
model.fit(X_train, y_train)
 
# Create a Pool object for the testing data
test_pool = Pool(X_test)
 
# Get feature importance scores
feature_importance = model.get_feature_importance(test_pool)
 
# Get feature names
feature_names = iris.feature_names
 
# Plot feature importance
plt.figure(figsize=(10, 6))
plt.barh(range(len(feature_importance)), 
         feature_importance, 
         tick_label=feature_names)
plt.xlabel('Feature Importance')
plt.ylabel('Features')
plt.title('Feature Importance for CatBoost Classifier')
plt.show()

Output:

Feature Importance

To calculate feature importance, we create a Pool object for the testing data (X_test) using Pool(X_test). The CatBoost uses model.get_feature_importance() method retrieves the feature importance scores. Finally, a bar plot is created to visualize the feature significance scores.

The resulting bar plot will show the importance of each feature in the model’s predictions. This information can help you identify which features are most relevant to the classification task and guide feature selection or engineering efforts.

Learning curve is also important in detecting overfitting. There is no separate method in CatBoost to plot learning curve. However, it can be plotted using other libraries in python like matplotlib.

CatBoost Metrics for model evaluation

To make sure our model’s performance satisfies evolving expectations and criteria, proper evaluation is crucial when it comes to machine learning model construction. Yandex’s CatBoost is a potent gradient-boosting library that gives machine learning practitioners and data scientists a toolbox of measures for evaluating model performance.

Table of Content

CatBoost
CatBoost Metrics
Metrics for Classification
Metrics for Regression
Metrics for Over-fitting Detection
Metric for Hyperparameter Tuning

Tags:

#CatBoost #Geeks Premier League 2023 #AI-ML-DS #Geeks Premier League #Machine Learning #Machine Learning

Metrics for Regression

Metric for Hyperparameter Tuning

Metrics for Over-fitting Detection

1. Cross-Validation

Python3

2. Feature Importance

Python3

CatBoost Metrics for model evaluation

Similar Reads