Implementation of CatBoost

Let’s implement CatBoost in Python.

Importing Libraries

Python3

# Importing necessary libraries
from catboost import CatBoostClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

CatBoostClassifier from catboost: This creates the classifier from the CatBoost library.
train_test_split: From Scikit-Learn, this function is used to split the dataset into training and testing sets.
load_iris: Loads the Iris dataset from Scikit-Learn. Iris dataset is a classic dataset in machine learning, containing measurements for 150 iris flowers from three different species.
accuracy_score: This function from Scikit-Learn computes the accuracy classification score, which measures the accuracy of the classification model.

Dataset Loading and Splitting

Python3

# Load the Iris dataset
iris = load_iris()
X, y = iris.data, iris.target
 
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42)

load_iris(): Loads the Iris dataset. iris.data contains the feature data(sepal length, sepal width, petal length, and petal width), and iris.target contains the corresponding labels (species: Setosa, Versicolor, or Virginica). We further split the data into training and testing sets using train_test_split, with 80% of the data used for training and 20% for testing. random_state ensures reproducibility.

Creating CatBoostClassifier Instance

Python3

# Create CatBoostClassifier instance
catboost_model = CatBoostClassifier(iterations=500, depth=6, learning_rate=0.1, loss_function='MultiClass',
                                    custom_metric='Accuracy', random_seed=42, verbose=200)

We create a CatBoostClassifier instance. Various hyperparameters are set, including:

iterations: The number of boosting iterations.
depth: The depth of the trees in the model.
learning_rate: The step size shrinkage used in update to prevent overfitting.
loss_function: The loss function used for training (in this case, ‘MultiClass’ for multi-class classification).
custom_metric: The metric used for evaluation (‘Accuracy’ in this case).
random_seed: Seed for random number generation to make the results reproducible.
verbose: Controls the amount of logging during training (higher values provide more detailed logging).

Training the Model

Python3

# Training the model
catboost_model.fit(X_train, y_train, eval_set=(X_test, y_test))

Output:

0:    learn: 0.9959553    test: 0.9895085    best: 0.9895085 (0)    total: 773us    remaining: 386ms
200:    learn: 0.0198651    test: 0.0157271    best: 0.0157271 (200)    total: 54.1ms    remaining: 80.4ms
400:    learn: 0.0089282    test: 0.0078847    best: 0.0078847 (400)    total: 99.7ms    remaining: 24.6ms
499:    learn: 0.0069487    test: 0.0062775    best: 0.0062775 (499)    total: 122ms    remaining: 0us

bestTest = 0.00627745227
bestIteration = 499

The model is trained using the training data (X_train, y_train). The eval_set parameter is used to specify the evaluation dataset (X_test, y_test), allowing the model’s performance to be monitored during training.

Predictions and Evaluation

The trained model is then used to make predictions on the test data (X_test), and the accuracy of the model is calculated using accuracy_score().

Python3

# Making predictions
predictions = catboost_model.predict(X_test)
 
# Calculating accuracy
accuracy = accuracy_score(y_test, predictions)
print("Accuracy: {:.2f}%".format(accuracy * 100))

Output:

Accuracy: 100.00%

Accuracy is the proportion of correctly predicted class labels. In this case, it’s 100%, indicating that 100% of the test samples were classified correctly.

Classification Report

Python3

# Generate and print the classification report
class_report = classification_report(y_test, predictions)
print("Classification Report:\n", class_report)

Output:

Classification Report:
               precision    recall  f1-score   support
           0       1.00      1.00      1.00        10
           1       1.00      1.00      1.00         9
           2       1.00      1.00      1.00        11
    accuracy                           1.00        30
   macro avg       1.00      1.00      1.00        30
weighted avg       1.00      1.00      1.00        30

Optimizing CatBoost

Although CatBoost has strong default settings, it may be further tuned by adjusting important parameters to improve model performance. ‘Eta,’ the learning rate, affects the step size during optimization. Higher learning rates expedite learning at the risk of exceeding the ideal solution, while lower learning rates assure stability but may necessitate more iterations. It is essential to balance this parameter in order to fine-tune.The ‘depth’ parameter determines the tree depth, which directly affects model complexity. While shorter trees minimize overfitting but may overlook complicated linkages, deeper trees are more able to capture detailed patterns but are also more prone to overfitting. Finding a balance between pattern capture and generalization is necessary to determine the ideal tree depth.

A model’s ability to learn is greatly influenced by the number of iterations, which is specified by the ‘iterations’ parameter. Although more iterations enable a more thorough knowledge of the data, if used excessively, they may cause overfitting. Validation set monitoring is frequently used to determine the optimal iteration count.In actuality, grid search and random search methods are used to experiment with these values during CatBoost hyperparameter tuning. Through this repeated process, data scientists are able to fine-tune the balance between model complexity and generalization for greater prediction performance, ultimately optimizing CatBoost for particular machine learning tasks.

CatBoost Optimization Technique

In the ever-evolving landscape of machine learning, staying ahead of the curve is essential. One such revolutionary optimization technique that has been making waves in the data science community is CatBoost. Developed by Yandex, a leading Russian multinational IT company, CatBoost is a high-performance, open-source library for gradient boosting on decision trees. In this article, we will explore the intricacies of CatBoost and understand why it has become the go-to choice for data scientists and machine learning practitioners worldwide.

Tags:

#CatBoost #Geeks Premier League 2023 #AI-ML-DS #Geeks Premier League #Machine Learning #Machine Learning

CatBoost

Conclusion

Implementation of CatBoost

Importing Libraries

Python3

Dataset Loading and Splitting

Python3

Creating CatBoostClassifier Instance

Python3

Training the Model

Python3

Predictions and Evaluation

Python3

Classification Report

Python3

Optimizing CatBoost

CatBoost Optimization Technique

Similar Reads