Implementation with CatBoostClassifier using various parameters on iris dataset

Import the Required Libraries

Python

import numpy as np
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, log_loss, roc_auc_score, classification_report
from catboost import CatBoostClassifier, Pool

Here we are importing some of the libraries such as numpy, pandas, classification metrics and some of the catboost libraries.

CatboostClassifier: A gradient boosting technique designed specifically for classification applications is the “CatBoostClassifier.” The CatBoost library contains it, which is an acronym for “categorical boosting.” CatBoost is well-known for its excellent performance and user-friendliness, and it works especially well with category characteristics.
Pool: The pool data structure in CatBoost is utilized to handle data efficiently for both training and evaluation. It includes features like custom feature names and categorical feature support and is built to operate with huge datasets.

Load the Iris Dataset and Split it into Training and Testing Datasets

Python

# Load the Iris dataset
iris = load_iris()
X = iris.data
y = iris.target
 
# Convert the target variable to binary classification (class 0 and class 1)
y = (y == 0).astype(int)
 
# Split the dataset into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

This code imports the Iris dataset first, which consists of target labels (y) and features (X). Next, it changes the target labels to binary classification, encoding other classes as 0 and class 0 as 1. Ultimately, the dataset is divided into training and testing sets in order to assess the model.

Create CatBoost Pools for efficient Data Handling

Python

# Create CatBoost Pools for efficient data handling
train_pool = Pool(data=X_train, label=y_train, cat_features=[], feature_names=iris.feature_names)
test_pool = Pool(data=X_test, label=y_test, cat_features=[], feature_names=iris.feature_names)

For effective data processing in the CatBoost classifier, these lines generate CatBoost Pools. The labels (y_train and y_test), empty categorical features (cat_features), and feature names from the Iris dataset are added, and the training and testing data (X_train and X_test) are transformed into a unique format appropriate for CatBoost. This enables effective CatBoost training and processing.

Defining CatBoost Parameters

Python

# Define CatBoost parameters
params = {
    'iterations': 100,
    'depth': 6,
    'learning_rate': 0.1,
    'loss_function': 'Logloss',  # Classification task
    'custom_metric': ['Accuracy', 'AUC'],  # Additional metrics to track
    'verbose': 10,  # Print training progress every 10 iterations
    'random_seed': 42  # Set a random seed for reproducibility
}

These lines define a CatBoost classifier’s settings, including the number of boosting iterations, the depth of the ensemble’s trees, the learning rate, the classification loss function (Logloss), and extra metrics to monitor (Accuracy and AUC) during training. The frequency of progress printing is managed by the verbose parameter, and by establishing a random seed, random_seed guarantees reproducibility of results.

Train and Evaluate the CatBoost Model

Python

# Train the CatBoost classifier
model = CatBoostClassifier(**params)
model.fit(train_pool, eval_set=test_pool)
 
# Make predictions on the test set
y_pred = model.predict(test_pool)
 
# Calculate evaluation metrics
accuracy = accuracy_score(y_test, y_pred)
logloss = log_loss(y_test, model.predict_proba(test_pool)[:, 1])
roc_auc = roc_auc_score(y_test, model.predict_proba(test_pool)[:, 1])
 
# Print evaluation metrics
print(f"Accuracy: {accuracy:.4f}")
print(f"Log Loss: {logloss:.4f}")
print(f"AUC: {roc_auc:.4f}")

Output:

0:    learn: 0.6333595    test: 0.6326569    best: 0.6326569 (0)    total: 5.68ms    remaining: 563ms
10:    learn: 0.2973689    test: 0.2938201    best: 0.2938201 (10)    total: 9.87ms    remaining: 79.9ms
20:    learn: 0.1637735    test: 0.1591490    best: 0.1591490 (20)    total: 13.7ms    remaining: 51.5ms
30:    learn: 0.1051307    test: 0.1011177    best: 0.1011177 (30)    total: 17.7ms    remaining: 39.5ms
40:    learn: 0.0715529    test: 0.0695287    best: 0.0695287 (40)    total: 21.5ms    remaining: 31ms
50:    learn: 0.0533052    test: 0.0515575    best: 0.0515575 (50)    total: 25.1ms    remaining: 24.1ms
60:    learn: 0.0416665    test: 0.0404120    best: 0.0404120 (60)    total: 28.6ms    remaining: 18.3ms
70:    learn: 0.0342899    test: 0.0332187    best: 0.0332187 (70)    total: 33.8ms    remaining: 13.8ms
80:    learn: 0.0294652    test: 0.0286255    best: 0.0286255 (80)    total: 37.4ms    remaining: 8.78ms
90:    learn: 0.0256959    test: 0.0250120    best: 0.0250120 (90)    total: 41.2ms    remaining: 4.07ms
99:    learn: 0.0230690    test: 0.0225294    best: 0.0225294 (99)    total: 45.1ms    remaining: 0us
bestTest = 0.02252943945
bestIteration = 99
Accuracy: 1.0000
Log Loss: 0.0225
AUC: 1.0000

This code assesses a CatBoost classifier’s performance on a test dataset after training it with given settings. It computes three evaluation measures, namely accuracy, log loss, and area under the ROC curve (AUC), while making predictions on the test set. Measuring the classifier’s overall accuracy, predictive quality, and discriminating power between classes, the metrics offer a thorough assessment of the model’s classification performance.

Classification Report

Python3

# Generate a classification report
class_report = classification_report(y_test, y_pred)
print("Classification Report:\n", class_report)

Output:

Classification Report:
               precision    recall  f1-score   support
           0       1.00      1.00      1.00        20
           1       1.00      1.00      1.00        10
    accuracy                           1.00        30
   macro avg       1.00      1.00      1.00        30
weighted avg       1.00      1.00      1.00        30

This function compares the predicted (y_pred) and actual (y_test) labels to provide a classification report that may be used to assess a model’s performance. The report is printed to the terminal and contains metrics for each class, including support, F1-score, precision, and recall.

CatBoost Tree Parameters

CatBoost is a popular gradient-boosting library known for its effectiveness in machine-learning competitions. It is particularly well-suited for tabular data and has several parameters that can be tuned to improve model performance. In this article, we will focus on CatBoost’s tree-related parameters and explore how they influence the model’s behaviour.

Tags:

#CatBoost #AI-ML-DS #Machine Learning #Machine Learning

Tree Parameters in CatBoost

Conclusion

Implementation with CatBoostClassifier using various parameters on iris dataset

Import the Required Libraries

Python

Load the Iris Dataset and Split it into Training and Testing Datasets

Python

Create CatBoost Pools for efficient Data Handling

Python

Defining CatBoost Parameters

Python

Train and Evaluate the CatBoost Model

Python

Classification Report

Python3

CatBoost Tree Parameters

Similar Reads