Data Preprocessing

Data preprocessing, which involves preparing raw data for analysis and modeling, is an essential stage in the pipeline for data analysis and machine learning. It plays a crucial part in raising the accuracy and dependability of the data, which eventually improves the effectiveness of machine learning models. Let’s see how to perform it:

Log Transformation and Distribution Plot

Python3




# Apply the natural logarithm transformation to the 'charges' column
df['charges'] = np.log1p(df['charges'])
  
# Create a distribution plot for the transformed 'charges' column
sb.distplot(df['charges'])
  
# Display the distribution plot
plt.show()


Output:

The ‘np.log1p’ function is used in this code to apply the natural logarithm transformation on the ‘charges’ column of the DataFrame (‘df’). The skewness in the data distribution is lessened because to this treatment. The distribution of values following the logarithmic transformation is then shown visually by a distribution plot (histogram) for the changed “charges” column made using “sb.distplot.” The distribution plot, which offers details on the altered data distribution, is then shown. The age and the bmi data is normally distributed but the charges are left skewed. We can perform logarithmic transformation to this dataset to convert it into normally distributed values.

One-Hot Encoding Categorical Columns

Python3




# Mapping Categorical to Numerical Values
  
# Map 'sex' column values ('male' to 0, 'female' to 1)
df['sex'] = df['sex'].map({'male': 0, 'female': 1})
  
# Map 'smoker' column values ('no' to 0, 'yes' to 1)
df['smoker'] = df['smoker'].map({'no': 0, 'yes': 1})
  
# Display the DataFrame's first few rows to show the transformations
df.head()


Output:

    age  sex     bmi  children  smoker   charges  northeast  northwest  \
0 19 NaN 27.900 0 NaN 9.734236 0 0
1 18 NaN 33.770 1 NaN 7.453882 0 0
2 28 NaN 33.000 3 NaN 8.400763 0 0
3 33 NaN 22.705 0 NaN 9.998137 0 1
4 32 NaN 28.880 0 NaN 8.260455 0 1
southeast southwest
0 0 1
1 1 0
2 1 0
3 0 0
4 0 0

This code performs categorical-to-numerical mapping for the ‘sex’ and ‘smoker’ columns, making the data suitable for machine learning algorithms that require numerical input. It also displays the initial rows of the DataFrame to illustrate the changes.

One-hot encoding for “Region” column

Python3




# Perform one-hot encoding on the 'region' column
temp = pd.get_dummies(df['region']).astype('int')
  
# Concatenate the one-hot encoded columns with the original DataFrame
df = pd.concat([df, temp], axis=1)


This code applies one-hot encoding to the’region’ column, turning categorical region values into binary columns that each represent a distinct region. The dataset is expanded with binary features for each region by concatenating the resulting one-hot encoded columns with the original DataFrame.

Python3




# Remove 'Id' and 'region' columns from the DataFrame
df.drop(['Id', 'region'], inplace=True, axis=1)
  
# Display the updated DataFrame
print(df.head())


Output:

   age  sex     bmi  children  smoker   charges  northeast  northwest  \
0 19 1 27.900 0 1 9.734236 0 0
1 18 0 33.770 1 0 7.453882 0 0
2 28 0 33.000 3 0 8.400763 0 0
3 33 0 22.705 0 0 9.998137 0 1
4 32 0 28.880 0 0 8.260455 0 1
southeast southwest
0 0 1
1 1 0
2 1 0
3 0 0
4 0 0

Now the only remaining column(categorical) is the region column let’s one hot encode this as the number of category in this column is more than 2 and nomialy encoding it will imply that we are giving preference without knowing the reality.

Splitting Data

Python3




# Define the features
features = df.drop('charges', axis=1)
  
# Define the target variable as 'charges'
target = df['charges']
  
# Split the data into training and validation sets
X_train, X_val, Y_train, Y_val = train_test_split(features, target,
                                                  random_state=2023,
                                                  test_size=0.25)
  
# Display the shapes of the training and validation sets
X_train.shape, X_val.shape


Output:

((1003, 11), (335, 11))

To evaluate the performance of the model while the training process goes on let’s split the dataset in 75:25 ratio and then use it to create lgb dataset and then train the model.

Feature scaling

Python3




# Standardize Features
  
# Use StandardScaler to scale the training and validation data
scaler = StandardScaler()
#Fit the StandardScaler to the training data
scaler.fit(X_train)
# transform both the training and validation data
X_train = scaler.transform(X_train)
X_val = scaler.transform(X_val)


This code fits the StandardScaler to the training data to calculate the mean and standard deviation and then transforms both the training and validation data using these calculated values to ensure consistent scaling between the two datasets.

Dataset Preparation

Python3




# Create a LightGBM dataset for training with features X_train and labels Y_train
train_data = lgb.Dataset(X_train, label=Y_train)
  
# Create a LightGBM dataset for testing with features X_val and labels Y_val,
# and specify the reference dataset as train_data for consistent evaluation
test_data = lgb.Dataset(X_val, label=Y_val, reference=train_data)


Now, by using the training and the validation data let’s create the training and the validation data using lgb.Dataset. Here it prepares the data for training and testing with lightGBM by creating dataset objects using provided features and labels.

Regression using LightGBM

In this article, we will learn about one of the state-of-the-art machine learning models: Lightgbm or light gradient boosting machine. After improvising more and more on the XGB model for better performance XGBoost which is an eXtreme Gradient Boosting machine but by the lightgbm we can achieve similar or better results without much computing and train our model on an even bigger dataset in less time. Let’s see what is LightGBM and how we can perform regression using LightGBM.

Table of Content

  • What is LightGBM?
  • How LightGBM Works?
  • Implementation of LightBGM
  • Exploratory Data Analysis
  • Data Preprocessing
  • Regression Model using LightGBM
  • Conclusion

Similar Reads

What is LightGBM?

...

How LightGBM Works?

LightGBM or ‘Light Gradient Boosting Machine’, is an open source, high-performance gradient boosting framework designed for efficient and scalable machine learning tasks. It is specially tailored for speed and accuracy, making it a popular choice for both structured and unstructured data in diverse domains....

Implementation of LightBGM

LightGBM creates a decision tree that develops leaf-wise, which implies that given a condition, just one leaf is split, depending on the benefit. Sometimes, especially with smaller datasets, leaf-wise trees might overfit. Overfitting can be prevented by limiting the tree depth. A histogram of the distribution is used by LightGBM to bucket data into bins. Instead of using every data point, the bins are used to iterate, calculate the gain, and divide the data. Additionally, a sparse dataset can benefit from this method’s optimization. Exclusive feature bundling, which refers to the algorithm’s combining of exclusive features to reduce dimensionality reduction and speed up processing, is another element of LightGBM....

Exploratory Data Analysis

In this article, we will use this dataset to perform a regression task using the lightGBM algorithm. But to use the LightGBM model we will first have to install the lightGBM model using the below command:...

Data Preprocessing

...

Regression Model using LightGBM

...

Conclusion

...