Implementation of CatBoost
Let’s implement CatBoost in Python.
Importing Libraries
Python3
# Importing necessary libraries from catboost import CatBoostClassifier from sklearn.datasets import load_iris from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score |
- CatBoostClassifier from catboost: This creates the classifier from the CatBoost library.
- train_test_split: From Scikit-Learn, this function is used to split the dataset into training and testing sets.
- load_iris: Loads the Iris dataset from Scikit-Learn. Iris dataset is a classic dataset in machine learning, containing measurements for 150 iris flowers from three different species.
- accuracy_score: This function from Scikit-Learn computes the accuracy classification score, which measures the accuracy of the classification model.
Dataset Loading and Splitting
Python3
# Load the Iris dataset iris = load_iris() X, y = iris.data, iris.target # Split the data into training and testing sets X_train, X_test, y_train, y_test = train_test_split( X, y, test_size = 0.2 , random_state = 42 ) |
load_iris(): Loads the Iris dataset. iris.data contains the feature data(sepal length, sepal width, petal length, and petal width), and iris.target contains the corresponding labels (species: Setosa, Versicolor, or Virginica). We further split the data into training and testing sets using train_test_split, with 80% of the data used for training and 20% for testing. random_state ensures reproducibility.
Creating CatBoostClassifier Instance
Python3
# Create CatBoostClassifier instance catboost_model = CatBoostClassifier(iterations = 500 , depth = 6 , learning_rate = 0.1 , loss_function = 'MultiClass' , custom_metric = 'Accuracy' , random_seed = 42 , verbose = 200 ) |
We create a CatBoostClassifier instance. Various hyperparameters are set, including:
- iterations: The number of boosting iterations.
- depth: The depth of the trees in the model.
- learning_rate: The step size shrinkage used in update to prevent overfitting.
- loss_function: The loss function used for training (in this case, ‘MultiClass’ for multi-class classification).
- custom_metric: The metric used for evaluation (‘Accuracy’ in this case).
- random_seed: Seed for random number generation to make the results reproducible.
- verbose: Controls the amount of logging during training (higher values provide more detailed logging).
Training the Model
Python3
# Training the model catboost_model.fit(X_train, y_train, eval_set = (X_test, y_test)) |
Output:
0: learn: 0.9959553 test: 0.9895085 best: 0.9895085 (0) total: 773us remaining: 386ms
200: learn: 0.0198651 test: 0.0157271 best: 0.0157271 (200) total: 54.1ms remaining: 80.4ms
400: learn: 0.0089282 test: 0.0078847 best: 0.0078847 (400) total: 99.7ms remaining: 24.6ms
499: learn: 0.0069487 test: 0.0062775 best: 0.0062775 (499) total: 122ms remaining: 0us
bestTest = 0.00627745227
bestIteration = 499
The model is trained using the training data (X_train, y_train). The eval_set parameter is used to specify the evaluation dataset (X_test, y_test), allowing the model’s performance to be monitored during training.
Predictions and Evaluation
The trained model is then used to make predictions on the test data (X_test), and the accuracy of the model is calculated using accuracy_score().
Python3
# Making predictions predictions = catboost_model.predict(X_test) # Calculating accuracy accuracy = accuracy_score(y_test, predictions) print ( "Accuracy: {:.2f}%" . format (accuracy * 100 )) |
Output:
Accuracy: 100.00%
Accuracy is the proportion of correctly predicted class labels. In this case, it’s 100%, indicating that 100% of the test samples were classified correctly.
Classification Report
Python3
# Generate and print the classification report class_report = classification_report(y_test, predictions) print ( "Classification Report:\n" , class_report) |
Output:
Classification Report:
precision recall f1-score support
0 1.00 1.00 1.00 10
1 1.00 1.00 1.00 9
2 1.00 1.00 1.00 11
accuracy 1.00 30
macro avg 1.00 1.00 1.00 30
weighted avg 1.00 1.00 1.00 30
Optimizing CatBoost
Although CatBoost has strong default settings, it may be further tuned by adjusting important parameters to improve model performance. ‘Eta,’ the learning rate, affects the step size during optimization. Higher learning rates expedite learning at the risk of exceeding the ideal solution, while lower learning rates assure stability but may necessitate more iterations. It is essential to balance this parameter in order to fine-tune.The ‘depth’ parameter determines the tree depth, which directly affects model complexity. While shorter trees minimize overfitting but may overlook complicated linkages, deeper trees are more able to capture detailed patterns but are also more prone to overfitting. Finding a balance between pattern capture and generalization is necessary to determine the ideal tree depth.
A model’s ability to learn is greatly influenced by the number of iterations, which is specified by the ‘iterations’ parameter. Although more iterations enable a more thorough knowledge of the data, if used excessively, they may cause overfitting. Validation set monitoring is frequently used to determine the optimal iteration count.In actuality, grid search and random search methods are used to experiment with these values during CatBoost hyperparameter tuning. Through this repeated process, data scientists are able to fine-tune the balance between model complexity and generalization for greater prediction performance, ultimately optimizing CatBoost for particular machine learning tasks.
CatBoost Optimization Technique
In the ever-evolving landscape of machine learning, staying ahead of the curve is essential. One such revolutionary optimization technique that has been making waves in the data science community is CatBoost. Developed by Yandex, a leading Russian multinational IT company, CatBoost is a high-performance, open-source library for gradient boosting on decision trees. In this article, we will explore the intricacies of CatBoost and understand why it has become the go-to choice for data scientists and machine learning practitioners worldwide.