What is Data Preprocessing?

In this article, we will learn Data Preprocessing,This free achine Learning tutorial for complete beginners will help you learn achine Learning from scratch.

Data Preprocessing

Exploratory Data Analysis

Regression Model using LightGBM

Data preprocessing, which involves preparing raw data for analysis and modeling, is an essential stage in the pipeline for data analysis and machine learning. It plays a crucial part in raising the accuracy and dependability of the data, which eventually improves the effectiveness of machine learning models. Let’s see how to perform it:

Log Transformation and Distribution Plot

Python3

# Apply the natural logarithm transformation to the 'charges' column 
df['charges'] = np.log1p(df['charges']) 
  
# Create a distribution plot for the transformed 'charges' column 
sb.distplot(df['charges']) 
  
# Display the distribution plot 
plt.show() 

Output:

The ‘np.log1p’ function is used in this code to apply the natural logarithm transformation on the ‘charges’ column of the DataFrame (‘df’). The skewness in the data distribution is lessened because to this treatment. The distribution of values following the logarithmic transformation is then shown visually by a distribution plot (histogram) for the changed “charges” column made using “sb.distplot.” The distribution plot, which offers details on the altered data distribution, is then shown. The age and the bmi data is normally distributed but the charges are left skewed. We can perform logarithmic transformation to this dataset to convert it into normally distributed values.

One-Hot Encoding Categorical Columns

Python3

# Mapping Categorical to Numerical Values 
  
# Map 'sex' column values ('male' to 0, 'female' to 1) 
df['sex'] = df['sex'].map({'male': 0, 'female': 1}) 
  
# Map 'smoker' column values ('no' to 0, 'yes' to 1) 
df['smoker'] = df['smoker'].map({'no': 0, 'yes': 1}) 
  
# Display the DataFrame's first few rows to show the transformations 
df.head() 

Output:

    age  sex     bmi  children  smoker   charges  northeast  northwest  \
0   19  NaN  27.900         0     NaN  9.734236          0          0   
1   18  NaN  33.770         1     NaN  7.453882          0          0   
2   28  NaN  33.000         3     NaN  8.400763          0          0   
3   33  NaN  22.705         0     NaN  9.998137          0          1   
4   32  NaN  28.880         0     NaN  8.260455          0          1   
   southeast  southwest  
0          0          1  
1          1          0  
2          1          0  
3          0          0  
4          0          0

This code performs categorical-to-numerical mapping for the ‘sex’ and ‘smoker’ columns, making the data suitable for machine learning algorithms that require numerical input. It also displays the initial rows of the DataFrame to illustrate the changes.

One-hot encoding for “Region” column

Python3

# Perform one-hot encoding on the 'region' column 
temp = pd.get_dummies(df['region']).astype('int') 
  
# Concatenate the one-hot encoded columns with the original DataFrame 
df = pd.concat([df, temp], axis=1) 

This code applies one-hot encoding to the’region’ column, turning categorical region values into binary columns that each represent a distinct region. The dataset is expanded with binary features for each region by concatenating the resulting one-hot encoded columns with the original DataFrame.

Python3

# Remove 'Id' and 'region' columns from the DataFrame 
df.drop(['Id', 'region'], inplace=True, axis=1) 
  
# Display the updated DataFrame 
print(df.head()) 

Output:

   age  sex     bmi  children  smoker   charges  northeast  northwest  \
0   19    1  27.900         0       1  9.734236          0          0   
1   18    0  33.770         1       0  7.453882          0          0   
2   28    0  33.000         3       0  8.400763          0          0   
3   33    0  22.705         0       0  9.998137          0          1   
4   32    0  28.880         0       0  8.260455          0          1   
   southeast  southwest  
0          0          1  
1          1          0  
2          1          0  
3          0          0  
4          0          0

Now the only remaining column(categorical) is the region column let’s one hot encode this as the number of category in this column is more than 2 and nomialy encoding it will imply that we are giving preference without knowing the reality.

Splitting Data

Python3

# Define the features 
features = df.drop('charges', axis=1) 
  
# Define the target variable as 'charges' 
target = df['charges'] 
  
# Split the data into training and validation sets 
X_train, X_val, Y_train, Y_val = train_test_split(features, target, 
                                                  random_state=2023, 
                                                  test_size=0.25) 
  
# Display the shapes of the training and validation sets 
X_train.shape, X_val.shape 

Output:

((1003, 11), (335, 11))

To evaluate the performance of the model while the training process goes on let’s split the dataset in 75:25 ratio and then use it to create lgb dataset and then train the model.

Feature scaling

Python3

# Standardize Features 
  
# Use StandardScaler to scale the training and validation data 
scaler = StandardScaler() 
#Fit the StandardScaler to the training data 
scaler.fit(X_train) 
# transform both the training and validation data 
X_train = scaler.transform(X_train) 
X_val = scaler.transform(X_val) 

This code fits the StandardScaler to the training data to calculate the mean and standard deviation and then transforms both the training and validation data using these calculated values to ensure consistent scaling between the two datasets.

Dataset Preparation

Python3

# Create a LightGBM dataset for training with features X_train and labels Y_train 
train_data = lgb.Dataset(X_train, label=Y_train) 
  
# Create a LightGBM dataset for testing with features X_val and labels Y_val, 
# and specify the reference dataset as train_data for consistent evaluation 
test_data = lgb.Dataset(X_val, label=Y_val, reference=train_data) 

Now, by using the training and the validation data let’s create the training and the validation data using lgb.Dataset. Here it prepares the data for training and testing with lightGBM by creating dataset objects using provided features and labels.

Regression using LightGBM

In this article, we will learn about one of the state-of-the-art machine learning models: Lightgbm or light gradient boosting machine. After improvising more and more on the XGB model for better performance XGBoost which is an eXtreme Gradient Boosting machine but by the lightgbm we can achieve similar or better results without much computing and train our model on an even bigger dataset in less time. Let’s see what is LightGBM and how we can perform regression using LightGBM.

Table of Content

What is LightGBM?
How LightGBM Works?
Implementation of LightBGM
Exploratory Data Analysis
Data Preprocessing
Regression Model using LightGBM
Conclusion

Data Preprocessing

Log Transformation and Distribution Plot

Python3

One-Hot Encoding Categorical Columns

Python3

One-hot encoding for “Region” column

Python3

Python3

Splitting Data

Python3

Feature scaling

Python3

Dataset Preparation

Python3

Regression using LightGBM

Table of Content

Similar Reads