Implementation with CatBoostClassifier using various parameters on iris dataset

Import the Required Libraries

Python




import numpy as np
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, log_loss, roc_auc_score, classification_report
from catboost import CatBoostClassifier, Pool


Here we are importing some of the libraries such as numpy, pandas, classification metrics and some of the catboost libraries.

CatboostClassifier: A gradient boosting technique designed specifically for classification applications is the “CatBoostClassifier.” The CatBoost library contains it, which is an acronym for “categorical boosting.” CatBoost is well-known for its excellent performance and user-friendliness, and it works especially well with category characteristics.
Pool: The pool data structure in CatBoost is utilized to handle data efficiently for both training and evaluation. It includes features like custom feature names and categorical feature support and is built to operate with huge datasets.

Load the Iris Dataset and Split it into Training and Testing Datasets

Python




# Load the Iris dataset
iris = load_iris()
X = iris.data
y = iris.target
 
# Convert the target variable to binary classification (class 0 and class 1)
y = (y == 0).astype(int)
 
# Split the dataset into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


This code imports the Iris dataset first, which consists of target labels (y) and features (X). Next, it changes the target labels to binary classification, encoding other classes as 0 and class 0 as 1. Ultimately, the dataset is divided into training and testing sets in order to assess the model.

Create CatBoost Pools for efficient Data Handling

Python




# Create CatBoost Pools for efficient data handling
train_pool = Pool(data=X_train, label=y_train, cat_features=[], feature_names=iris.feature_names)
test_pool = Pool(data=X_test, label=y_test, cat_features=[], feature_names=iris.feature_names)


For effective data processing in the CatBoost classifier, these lines generate CatBoost Pools. The labels (y_train and y_test), empty categorical features (cat_features), and feature names from the Iris dataset are added, and the training and testing data (X_train and X_test) are transformed into a unique format appropriate for CatBoost. This enables effective CatBoost training and processing.

Defining CatBoost Parameters

Python




# Define CatBoost parameters
params = {
    'iterations': 100,
    'depth': 6,
    'learning_rate': 0.1,
    'loss_function': 'Logloss'# Classification task
    'custom_metric': ['Accuracy', 'AUC'],  # Additional metrics to track
    'verbose': 10# Print training progress every 10 iterations
    'random_seed': 42  # Set a random seed for reproducibility
}


These lines define a CatBoost classifier’s settings, including the number of boosting iterations, the depth of the ensemble’s trees, the learning rate, the classification loss function (Logloss), and extra metrics to monitor (Accuracy and AUC) during training. The frequency of progress printing is managed by the verbose parameter, and by establishing a random seed, random_seed guarantees reproducibility of results.

Train and Evaluate the CatBoost Model

Python




# Train the CatBoost classifier
model = CatBoostClassifier(**params)
model.fit(train_pool, eval_set=test_pool)
 
# Make predictions on the test set
y_pred = model.predict(test_pool)
 
# Calculate evaluation metrics
accuracy = accuracy_score(y_test, y_pred)
logloss = log_loss(y_test, model.predict_proba(test_pool)[:, 1])
roc_auc = roc_auc_score(y_test, model.predict_proba(test_pool)[:, 1])
 
# Print evaluation metrics
print(f"Accuracy: {accuracy:.4f}")
print(f"Log Loss: {logloss:.4f}")
print(f"AUC: {roc_auc:.4f}")


Output:

0:    learn: 0.6333595    test: 0.6326569    best: 0.6326569 (0)    total: 5.68ms    remaining: 563ms
10: learn: 0.2973689 test: 0.2938201 best: 0.2938201 (10) total: 9.87ms remaining: 79.9ms
20: learn: 0.1637735 test: 0.1591490 best: 0.1591490 (20) total: 13.7ms remaining: 51.5ms
30: learn: 0.1051307 test: 0.1011177 best: 0.1011177 (30) total: 17.7ms remaining: 39.5ms
40: learn: 0.0715529 test: 0.0695287 best: 0.0695287 (40) total: 21.5ms remaining: 31ms
50: learn: 0.0533052 test: 0.0515575 best: 0.0515575 (50) total: 25.1ms remaining: 24.1ms
60: learn: 0.0416665 test: 0.0404120 best: 0.0404120 (60) total: 28.6ms remaining: 18.3ms
70: learn: 0.0342899 test: 0.0332187 best: 0.0332187 (70) total: 33.8ms remaining: 13.8ms
80: learn: 0.0294652 test: 0.0286255 best: 0.0286255 (80) total: 37.4ms remaining: 8.78ms
90: learn: 0.0256959 test: 0.0250120 best: 0.0250120 (90) total: 41.2ms remaining: 4.07ms
99: learn: 0.0230690 test: 0.0225294 best: 0.0225294 (99) total: 45.1ms remaining: 0us
bestTest = 0.02252943945
bestIteration = 99
Accuracy: 1.0000
Log Loss: 0.0225
AUC: 1.0000

This code assesses a CatBoost classifier’s performance on a test dataset after training it with given settings. It computes three evaluation measures, namely accuracy, log loss, and area under the ROC curve (AUC), while making predictions on the test set. Measuring the classifier’s overall accuracy, predictive quality, and discriminating power between classes, the metrics offer a thorough assessment of the model’s classification performance.

Classification Report

Python3




# Generate a classification report
class_report = classification_report(y_test, y_pred)
print("Classification Report:\n", class_report)


Output:

Classification Report:
precision recall f1-score support
0 1.00 1.00 1.00 20
1 1.00 1.00 1.00 10
accuracy 1.00 30
macro avg 1.00 1.00 1.00 30
weighted avg 1.00 1.00 1.00 30

This function compares the predicted (y_pred) and actual (y_test) labels to provide a classification report that may be used to assess a model’s performance. The report is printed to the terminal and contains metrics for each class, including support, F1-score, precision, and recall.

CatBoost Tree Parameters

CatBoost is a popular gradient-boosting library known for its effectiveness in machine-learning competitions. It is particularly well-suited for tabular data and has several parameters that can be tuned to improve model performance. In this article, we will focus on CatBoost’s tree-related parameters and explore how they influence the model’s behaviour.

Similar Reads

CatBoost

CatBoost, short for Categorical Boosting, is a gradient-boosting algorithm developed by Yandex. It is designed to handle categorical features effectively without the need for extensive preprocessing. CatBoost is known for its robustness, speed, and competitive performance in a wide range of machine-learning tasks.CatBoost is a gradient-boosting algorithm specifically designed for categorical feature support....

Tree Parameters in CatBoost

CatBoost provides a variety of parameters that allow you to control the behavior of decision trees. These parameters influence the depth of trees, regularization, and other aspects of the boosting process. Let’s explore some of the most important tree-related parameters:...

Implementation with CatBoostClassifier using various parameters on iris dataset

...

Conclusion

...