Implementation of Regression Using CatBoost
We will use this dataset to perform a regression task using the catboost algorithm. But to use the catboost model we will first have to install the catboost package model using the below command:
Installing Packages
!pip install catboost
Importing Libraries and Dataset
Python libraries make it very easy for us to handle the data and perform typical and complex tasks with a single line of code.
- Pandas – This library helps to load the data frame in a 2D array format and has multiple functions to perform analysis tasks in one go.
- Numpy – Numpy arrays are very fast and can perform large computations in a very short time.
- Matplotlib/Seaborn – This library is used to draw visualizations.
- Sklearn – This module contains multiple libraries having pre-implemented functions to perform tasks from data preprocessing to model development and evaluation.
Python3
#importing libraries import pandas as pd import numpy as np import seaborn as sb import matplotlib.pyplot as plt import lightgbm as lgb from sklearn.preprocessing import StandardScaler from sklearn.model_selection import train_test_split import warnings warnings.filterwarnings( 'ignore' ) |
Loading Dataset and Retriving Information
Python3
#loading dataset df = pd.read_csv( 'House_Rent_Dataset.csv' ) print (df.head()) |
Output:
Posted On BHK Rent Size Floor Area Type \
0 2022-05-18 2 10000 1100 Ground out of 2 Super Area
1 2022-05-13 2 20000 800 1 out of 3 Super Area
2 2022-05-16 2 17000 1000 1 out of 3 Super Area
3 2022-07-04 2 10000 800 1 out of 2 Super Area
4 2022-05-09 2 7500 850 1 out of 2 Carpet Area
Area Locality City Furnishing Status Tenant Preferred \
0 Bandel Kolkata Unfurnished Bachelors/Family
1 Phool Bagan, Kankurgachi Kolkata Semi-Furnished Bachelors/Family
2 Salt Lake City Sector 2 Kolkata Semi-Furnished Bachelors/Family
3 Dumdum Park Kolkata Unfurnished Bachelors/Family
4 South Dum Dum Kolkata Unfurnished Bachelors
Bathroom Point of Contact
0 2 Contact Owner
1 1 Contact Owner
2 1 Contact Owner
3 1 Contact Owner
4 1 Contact Owner
Here, we are loading the dataset and printing the top five rows in the datset.
Python3
#printing the shape of the dataset df.shape |
Output:
(4746, 12)
Here, ‘df.shape’ prints the dimensions of the dataframe ‘df’.
Python3
# Display summary information about the DataFrame 'df' df.info() |
Output:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4746 entries, 0 to 4745
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Posted On 4746 non-null object
1 BHK 4746 non-null int64
2 Rent 4746 non-null int64
3 Size 4746 non-null int64
4 Floor 4746 non-null object
5 Area Type 4746 non-null object
6 Area Locality 4746 non-null object
7 City 4746 non-null object
8 Furnishing Status 4746 non-null object
9 Tenant Preferred 4746 non-null object
10 Bathroom 4746 non-null int64
11 Point of Contact 4746 non-null object
dtypes: int64(4), object(8)
memory usage: 445.1+ KB
Here, ‘df.info()’ displays the summary information about the dataframe ‘df’. It provides details such as no. of null-entries in each column, data types, and memory usage.
Python3
# Generate summary statistics of the DataFrame 'df' print (df.describe()) |
Output:
BHK Rent Size Bathroom
count 4746.000000 4.746000e+03 4746.000000 4746.000000
mean 2.083860 3.499345e+04 967.490729 1.965866
std 0.832256 7.810641e+04 634.202328 0.884532
min 1.000000 1.200000e+03 10.000000 1.000000
25% 2.000000 1.000000e+04 550.000000 1.000000
50% 2.000000 1.600000e+04 850.000000 2.000000
75% 3.000000 3.300000e+04 1200.000000 2.000000
max 6.000000 3.500000e+06 8000.000000 10.000000
Here, ‘df.describe()’ computes and displays basic statistical summary information for the numeric columns in the dataframe.
Regression using CatBoost
In this article, we will learn about one of the state-of-the-art machine learning models: Catboost here cat stands for categorical which implies that this algorithm is highly efficient when your data contains many categorical columns.
Table of Content
- What is CatBoost?
- How Catboost Works?
- Implementation of Regression Using CatBoost
- Exploratory Data Analysis
- Data Preprocessing
- Model Development