Catboost Regression Metrics

CatBoost is a powerful gradient boosting library that has gained popularity in recent years due to its ease of use, efficiency, and high performance. One of the key aspects of using CatBoost is understanding the various metrics it provides for evaluating the performance of regression models.

In this article, we will delve into the world of CatBoost regression metrics, exploring what they are, how they work, and how to interpret them with practical examples.

Table of Content

  • Understanding Regression Metrics
  • Common Catboost Regression Metrics
  • Utilizing Catboost Regression Metrics
  • Choosing the Right Catboost Regression Metric

Understanding Regression Metrics

Regression metrics are used to measure the performance of a model in predicting continuous outcomes. In CatBoost, these metrics are essential for evaluating the accuracy and reliability of regression models. The choice of metric depends on the specific problem and the type of data being used.

Common Catboost Regression Metrics

1. Mean Squared Error (MSE)

In simple terms it calculates how far off the predicted values were from the real values, and what is then averaged over the entire data set. As with any other machine learning model, a lower MSE is preferable because it paints a picture of a model that gives more accurate predictions close to the original results. It indicates how well regression line fits with data.

[Tex]\text{MSE} = \frac{1}{n} \sum_{i=1}^{n} (y_i – \hat{y}_i)^2 [/Tex]

Where:

  • ? is the number of observations.
  • ?i is the actual value.
  • ?i is the predicted value.

2. Mean Absolute Error (MAE)

Another significant measure is Mean Absolute Error or commonly known as Average Absolute Error, which sums up the difference between mean absolute error and actuality. MAE is less affected by outliers as compared to MSE and hence, helpful in scenarios where, the model under consideration tends to be sensitive to extreme values of the errors. We calculate MSE by the following way:

[Tex]\text{MAE} = \frac{1}{n} \sum_{i=1}^{n} \left| y_i – \hat{y}_i \right|[/Tex]

where,

  • n is the number of observations,
  • ?i is the actual value for observation
  • ŷi is the predicted value for observation

3. Root Mean Squared Error (RMSE)

Root Mean squared Error (RMSE) is just another form of Mean Squared Error where an average of the square of difference between the values predicted and the actual value is then squared and taken to the root. As mentioned earlier, the RMSE is arguably more interpretative than the MSE since its measure of error units to the target variable. Formula used for RMSE is :

[Tex]\text{RMSE} = \sqrt{\frac{1}{n} \sum_{i=1}^{n} (y_i – \hat{y}_i)^2} [/Tex]

where

  • n is the number of observations,
  • ?i is the actual value for observation
  • ŷi is the predicted value for observation

4. R Squared(R2)

R Squared that assess the proportion of variability in the target variable which could be explained by the model and therefore give a perfect figure whereas the adjusted R Squared gives more weight to the number of predictors that are used in the model.

[Tex]R^2 = \frac{\sum_{i=1}^{n} (y_i – \overline{y})^2}{\sum_{i=1}^{n} (y_i – \bar{y})^2} = 1 – \frac{\sum_{i=1}^{n} (y_i – \hat{y}_i)^2}{\sum_{i=1}^{n} (y_i – \overline{y})^2}[/Tex]

This formula represents (R-squared) as 1 minus the ratio of the sum of squared residuals to the total sum of squares, where:

  • n is the number of observations,
  • ?i is the actual value for observation
  • ŷi is the predicted value for observation
  • ȳ is the mean of the actual values.

5. Explained Variation Score(EVS)

Explained Variation of Score or EVC is a measure that is commonly used when it comes to the evaluation of a model for regression analysis with the aim of estimating the extent to which model’s predictions explain variance of the target variable. It is defined by the extent of the variation in the target variable accounted for by the model. It is expressed as follows:

[Tex]\text{EVS} = 1 – \frac{\sum_{i=1}^{n} (y_i – \hat{y}_i)^2}{\sum_{i=1}^{n} (y_i – \bar{y})^2}[/Tex]

where:

  • n is the number of observations,
  • ?i is the true value
  • ŷi is the predicted value
  • ȳ is the mean of the true values.

Utilizing Catboost Regression Metrics

When interpreting CatBoost regression metrics, it’s essential to consider the context of the problem and the type of data being used. Here are some general guidelines:

  • Lower is better: For metrics like MSE, MAE, and RMSE, a lower value indicates better model performance.
  • Higher is better: For metrics like R-Squared, a higher value indicates better model performance.
  • Context matters: The choice of metric and the interpretation of results depend on the specific problem and data.

Lets take an example to point out an instance of catboost regression metrics on Iris Dataset.

Implement Catboost Algorithm

Python

import math import pandas as pd from sklearn.datasets import load_iris from sklearn.model_selection import train_test_split from catboost import CatBoostRegressor from sklearn.metrics import mean_squared_error, r2_score, explained_variance_score iris = load_iris() X = iris.data y = iris.target # Split data into train and test sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) model = CatBoostRegressor(iterations=100, learning_rate=0.1, loss_function='RMSE') model.fit(X_train, y_train) y_pred = model.predict(X_test)

Calculate Catboost Regression Metrics

Python

# Calculate regression metrics mse = mean_squared_error(y_test, y_pred) r2 = r2_score(y_test, y_pred) explained_variance = explained_variance_score(y_test, y_pred) rmse = math.sqrt(mse) print(f"Mean Squared Error (MSE): {mse:.4f}") print(f"Root Mean Squared Error (RMSE): {rmse:.4f}") print(f"R-squared (R^2): {r2:.4f}") print(f"Explained Variance Score: {explained_variance:.4f}")

Output:

Mean Squared Error (MSE): 0.0067 Root Mean Squared Error (RMSE): 0.0817 R-squared (R^2): 0.9904 Explained Variance Score: 0.9906

Choosing the Right Catboost Regression Metric

  • Prioritize interpretability: If you need to easily explain your model’s performance to stakeholders, MAE or RMSE are often preferable. They directly relate to the units of your target variable. RMSE is suitable when large errors are particularly undesirable.
  • Outliers are a concern: If your dataset has outliers that you don’t want to overly influence your model evaluation, MAE is a good choice. It treats all errors equally.
  • Sensitivity to large errors is important: If it’s critical to capture and penalize large prediction errors, MSE or RMSE are more suitable.
  • Model fit assessment: R² or EVS provide a good overview of how well your model captures the overall variance in the target variable. R^2 is useful for understanding the proportion of variance explained by the model but should be used alongside other metrics.

Conclusion

In conclusion, Catboost is an effective algorithm for regression analysis, and it is possible to control such measurements as accuracy, overspending, area under the ROC curve, cross entropy, and others in the process of teaching the algorithm. If such metrics are applied and used by data scientists are guaranteed of having reliable regression models which can help to offer sound forecasts.