Implementation of Regression Using CatBoost

We will use this dataset to perform a regression task using the catboost algorithm. But to use the catboost model we will first have to install the catboost package model using the below command:

Installing Packages

!pip install catboost

Importing Libraries and Dataset

Python libraries make it very easy for us to handle the data and perform typical and complex tasks with a single line of code.

  • Pandas – This library helps to load the data frame in a 2D array format and has multiple functions to perform analysis tasks in one go.
  • Numpy – Numpy arrays are very fast and can perform large computations in a very short time.
  • Matplotlib/Seaborn – This library is used to draw visualizations.
  • Sklearn – This module contains multiple libraries having pre-implemented functions to perform tasks from data preprocessing to model development and evaluation.

Python3




#importing libraries 
import pandas as pd
import numpy as np
import seaborn as sb
import matplotlib.pyplot as plt
import lightgbm as lgb
  
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
  
import warnings
warnings.filterwarnings('ignore')


Loading Dataset and Retriving Information

Python3




#loading dataset
df = pd.read_csv('House_Rent_Dataset.csv')
print(df.head())


Output:

    Posted On  BHK   Rent  Size            Floor    Area Type  \
0 2022-05-18 2 10000 1100 Ground out of 2 Super Area
1 2022-05-13 2 20000 800 1 out of 3 Super Area
2 2022-05-16 2 17000 1000 1 out of 3 Super Area
3 2022-07-04 2 10000 800 1 out of 2 Super Area
4 2022-05-09 2 7500 850 1 out of 2 Carpet Area
Area Locality City Furnishing Status Tenant Preferred \
0 Bandel Kolkata Unfurnished Bachelors/Family
1 Phool Bagan, Kankurgachi Kolkata Semi-Furnished Bachelors/Family
2 Salt Lake City Sector 2 Kolkata Semi-Furnished Bachelors/Family
3 Dumdum Park Kolkata Unfurnished Bachelors/Family
4 South Dum Dum Kolkata Unfurnished Bachelors
Bathroom Point of Contact
0 2 Contact Owner
1 1 Contact Owner
2 1 Contact Owner
3 1 Contact Owner
4 1 Contact Owner

Here, we are loading the dataset and printing the top five rows in the datset.

Python3




#printing the shape of the dataset
df.shape


Output:

(4746, 12)

Here, ‘df.shape’ prints the dimensions of the dataframe ‘df’.

Python3




# Display summary information about the DataFrame 'df'
df.info()


Output:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4746 entries, 0 to 4745
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Posted On 4746 non-null object
1 BHK 4746 non-null int64
2 Rent 4746 non-null int64
3 Size 4746 non-null int64
4 Floor 4746 non-null object
5 Area Type 4746 non-null object
6 Area Locality 4746 non-null object
7 City 4746 non-null object
8 Furnishing Status 4746 non-null object
9 Tenant Preferred 4746 non-null object
10 Bathroom 4746 non-null int64
11 Point of Contact 4746 non-null object
dtypes: int64(4), object(8)
memory usage: 445.1+ KB

Here, ‘df.info()’ displays the summary information about the dataframe ‘df’. It provides details such as no. of null-entries in each column, data types, and memory usage.

Python3




# Generate summary statistics of the DataFrame 'df'
print(df.describe())


Output:

               BHK          Rent         Size     Bathroom
count 4746.000000 4.746000e+03 4746.000000 4746.000000
mean 2.083860 3.499345e+04 967.490729 1.965866
std 0.832256 7.810641e+04 634.202328 0.884532
min 1.000000 1.200000e+03 10.000000 1.000000
25% 2.000000 1.000000e+04 550.000000 1.000000
50% 2.000000 1.600000e+04 850.000000 2.000000
75% 3.000000 3.300000e+04 1200.000000 2.000000
max 6.000000 3.500000e+06 8000.000000 10.000000

Here, ‘df.describe()’ computes and displays basic statistical summary information for the numeric columns in the dataframe.

Regression using CatBoost

In this article, we will learn about one of the state-of-the-art machine learning models: Catboost here cat stands for categorical which implies that this algorithm is highly efficient when your data contains many categorical columns.

Table of Content

  • What is CatBoost?
  • How Catboost Works?
  • Implementation of Regression Using CatBoost
  • Exploratory Data Analysis
  • Data Preprocessing
  • Model Development

Similar Reads

What is CatBoost?

...

How Catboost Works?

CatBoost, (Categorical Boosting), is a high-performance, open-source, gradient-boosting framework developed by Yandex. It is designed for solving a wide range of machine learning tasks, including classification, regression, and ranking, with a particular emphasis on handling categorical features efficiently. Catboost stands out for its speed, accuracy, and ease of use in dealing with structured data....

Implementation of Regression Using CatBoost

Catboost is a high-performance gradient-boosting technique made for machine learning tasks, especially in situations involving structured input. Gradient boosting, an ensemble learning technique, forms the basis of its main workings. Catboost begins by speculating, frequently the mean of the target variable. The ensemble of decision trees is then gradually built, with each tree seeking to eliminate the errors or residuals from the previous ones. Catboost stands out because of how well it handles category features. Catboost uses a method termed “ordered boosting” to process categorical data directly, resulting in faster training and better model performance....

Exploratory Data Analysis

We will use this dataset to perform a regression task using the catboost algorithm. But to use the catboost model we will first have to install the catboost package model using the below command:...

Data Preprocessing

...

Model Development

...

Conclusion

...