Multiregression using CatBoost

Multiregression, also known as multiple regression, is a statistical method used to predict a target variable based on two or more predictor variables. This technique is widely used in various fields such as finance, economics, marketing, and machine learning. CatBoost, a powerful gradient boosting library, provides efficient and robust algorithms for multiregression tasks. In this article, we will explore how to leverage CatBoost for multiregression and achieve accurate predictions.

Table of Content

  • Understanding Multiregression
  • What is CatBoost?
  • Implementing Multiregression with CatBoost
  • Pros & Cons of Using CatBoost for Multiregression
  • Conclusion

Implementing Multiregression with CatBoost

Let’s dive into a practical example of using CatBoost for multiregression:

Install CatBoost

Ensure you have CatBoost installed in your Python environment. You can install it via pip:

pip install catboost

Step 1: Loading a Public Dataset

We’ll using an online publicly accessible dataset for this example. Using its URL, we’ll load it immediately.

Python

import pandas as pd # Load dataset url = 'https://media.w3wiki.net/wp-content/uploads/20240527142547/BostonHousing.csv' df = pd.read_csv(url) print(df.head())

Output:

crim zn indus chas nox rm age dis rad tax ptratio \
0 0.00632 18.0 2.31 0 0.538 6.575 65.2 4.0900 1 296 15.3
1 0.02731 0.0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8
2 0.02729 0.0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8
3 0.03237 0.0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7
4 0.06905 0.0 2.18 0 0.458 7.147 54.2 6.0622 3 222 18.7

b lstat medv
0 396.90 4.98 24.0
1 396.90 9.14 21.6
2 392.83 4.03 34.7
3 394.63 2.94 33.4
4 396.90 5.33 36.2

Step 2: Preprocessing Data

We’ll prepare the data for modeling, which may include encoding categorical features if present.

Python

import seaborn as sns import matplotlib.pyplot as plt # Visualize the distribution of the target variable sns.histplot(df['medv'], bins=30, kde=True) plt.title('Distribution of MEDV (Median Value of Homes)') plt.savefig('Distribution.webp') plt.show()

Output:



Our data must be ready for the model. This covers managing missing values, standardizing the data, and encoding categorical characteristics.

Python

from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler # Split the data into features and target X = df.drop('medv', axis=1) y = df['medv'] # Split the data into training and testing sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) # Normalize the feature data scaler = StandardScaler() X_train_scaled = scaler.fit_transform(X_train) X_test_scaled = scaler.transform(X_test)

Step 3: Train the Model

Now, we will define and train our CatBoost regressor model.

Python

from catboost import CatBoostRegressor # Initialize the CatBoostRegressor model = CatBoostRegressor( iterations=1000, learning_rate=0.05, depth=3, loss_function='RMSE', verbose=200) # Fit the model model.fit(X_train_scaled, y_train)

Output:

0: learn: 9.0223472 total: 138ms remaining: 2m 18s
200: learn: 2.4369710 total: 252ms remaining: 1s
400: learn: 1.8078506 total: 365ms remaining: 545ms
600: learn: 1.4641839 total: 475ms remaining: 315ms
800: learn: 1.2249782 total: 587ms remaining: 146ms
999: learn: 1.0551550 total: 696ms remaining: 0us
<catboost.core.CatBoostRegressor at 0x193071691d0>

Step 4: Making Predictions and Evaluating the Model

After training, we make predictions on the test set and evaluate our model using RMSE.

Python

from sklearn.metrics import mean_squared_error # Make predictions predictions = model.predict(X_test_scaled) # Calculate RMSE rmse = mean_squared_error(y_test, predictions, squared=False) print(f'Root Mean Squared Error: {rmse}')

Output:

Root Mean Squared Error: 2.9516912601424115

Step 5: Visualizing the Results

Lastly, in order to evaluate the performance of our model, we will plot the actual values against the predictions.

Python

# Visualize the actual vs predicted values plt.scatter(y_test, predictions) plt.xlabel('Actual Values') plt.ylabel('Predicted Values') plt.title('Actual vs Predicted Values') plt.plot([min(y_test), max(y_test)], [min(y_test), max(y_test)], color='red') # Diagonal line plt.show()

Output:

These examples offer a detailed how-to use CatBoost for multiregression, including the steps of data preparation, model training, and result visualization. Recall that practice and experimentation are the keys to mastering machine learning, so feel free to experiment with other datasets, and parameter adjustments to observe how the model performs.

Understanding Multiregression

Multiregression extends the concept of simple linear regression by allowing multiple independent variables to be used in predicting a dependent variable. The relationship between the predictor variables and the target variable is expressed through a linear equation:

[Tex]Y = \beta_0 + \beta_1X_1 + \beta_2X_2 + \ldots + \beta_nX_n + \epsilon[/Tex]

Where:

  • Y is the dependent variable (target).
  • [Tex]X_1, X_2, \ldots, X_n[/Tex] are the independent variables (predictors).
  • [Tex]\beta_0, \beta_1, \ldots, \beta_n[/Tex] are the coefficients representing the strength and direction of the relationship between the predictors and the target.
  • [Tex]\epsilon[/Tex] is the error term.

What is CatBoost?

CatBoost stands for Categorical Boosting. It is an open-source gradient boosting library developed by Yandex that is particularly powerful for datasets with categorical features. This is a robust open-source library excelling in gradient boosting, a machine learning technique well-suited for regression problems. It is renowned for being rapid, and effective. It’s a versatile tool that works well with many sorts of data, including those that are categorical (such colors or types). Among CatBoost’s noteworthy attributes are:

  • Support for Categorical Data: Unlike other boosting algorithms, CatBoost can directly handle categorical features without the need for explicit encoding.
  • Fast Training and Prediction: CatBoost is optimized for speed, making it suitable for large datasets.
  • Excellent Performance: In terms of accuracy and generalization, it frequently performs better, than other gradient boosting techniques like XGBoost and LightGBM.

Conclusion

Multiregression is a powerful technique for predicting a target variable based on multiple predictor variables. With the advent of advanced machine learning libraries like CatBoost, performing multiregression tasks has become more accessible and efficient. By following the steps outlined in this article, you can leverage CatBoost to build accurate multiregression models for a wide range of applications. Experiment with different parameters and features to fine-tune your models and achieve optimal performance.