Visualizing LightGBM Feature Importance
First, make sure you have LightGBM installed:
! pip install lightgbm
Let’s break down the provided code step by step:
Step 1: Import Libraries
In this step, we import the necessary libraries that the code will use:
- lightgbm for building the gradiant boosting framework
- matplotlib.pyplot for creating plots
- sklearn.datasets to import breast cancer dataset for classification
- train_test_split, numpy and pandas to perform data pre processing
Python3
#Importing Necessary Libraries import pandas as pd import numpy as np import lightgbm as lgb import matplotlib.pyplot as plt from sklearn.datasets import load_breast_cancer from sklearn.model_selection import train_test_split |
Step 2: Create a LightGBM Dataset
Here, a LightGBM dataset named train_data is created. This dataset is specifically formatted for training the LightGBM model. It is constructed using the following inputs:
- X_train: This variable is assumed to contain the training feature data (i.e., the independent variables).
- y_train: This variable is assumed to contain the corresponding target labels (i.e., the dependent variable or the values you want to predict).
Python3
# Loading the Breast Cancer Dataset cancer = load_breast_cancer() # Creating dataframe df = pd.DataFrame(np.c_[cancer[ 'data' ], cancer[ 'target' ]], columns = np.append(cancer[ 'feature_names' ], [ 'target' ])) ## Features X = df.drop([ 'target' ], axis = 1 ) ## Target y = df[ 'target' ] # Splitting the dataset in test and train datasets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3 , random_state = 0 ) # Creating the dataframe train_data = lgb.Dataset(X_train, label = y_train) |
Step 3: Define Model Parameters
In this step, a dictionary named `params` is defined. This dictionary holds various configuration parameters that will be used to set up the LightGBM model. Here’s what each parameter means:
- objective specifies the objective of the model
- metric specifies the evaluation metric that the model should optimize during training
- boosting_type indicates the boosting type to be used in LightGBM. gbdt stands for Gradient Boosting Decision Trees, one of the boosting methods available in LightGBM.
These parameters define how the model will be trained and evaluated.
Python
# Define parameters for the model params = { "objective" : "binary" , "metric" : "binary_logloss" , "boosting_type" : "gbdt" , "learning_rate" : 0.1 } |
Step 4: Train the LightGBM Model
In this step, the LightGBM model is trained using the lgb.train function. Here’s what’s happening:
- params is the model configuration parameters defined earlier are passed as the first argument.
- train_data is LightGBM training dataset is provided as the second argument.
- num_boost_round=5 specifies the number of boosting rounds or iterations during training. The model is trained for 5 rounds, and each round involves adding a decision tree to the ensemble.
After this step, the model variable contains the trained LightGBM model.
Python3
# Train the LightGBM model model = lgb.train(params, train_data, num_boost_round = 5 ) |
Output:
[LightGBM] [Info] Number of positive: 249, number of negative: 149
[LightGBM] [Warning] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000248 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 3978
[LightGBM] [Info] Number of data points in the train set: 398, number of used features: 30
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.625628 -> initscore=0.513507
[LightGBM] [Info] Start training from score 0.513507
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
Step 5: Plot Feature Importance
Finally, the code visualizes the feature importance using the lgb.plot_importance function and Matplotlib. Here’s what each part of this step does:
- lgb.plot_importance(model, importance_type=”gain”, figsize=(7,6), title=”LightGBM Feature Importance (Gain)”) generates a feature importance plot based on the trained LightGBM model. It specifies the importance type as “gain,” which calculates feature importance based on the gain in accuracy achieved by using each feature for splitting in the decision trees. It also sets the figure size and provides a title for the plot.
- lgb.plot_importance(model, importance_type=”split”, figsize=(7, 6), title=”LightGBM Feature Importance (Split)”) creates a feature importance plot based on the ‘split’ metric. This metric measures how often a feature is used to split the data in decision trees during training, which helps assess the feature’s importance in making decisions.
Plot feature importance using Gain
Python3
# Plot feature importance using Gain lgb.plot_importance(model, importance_type = "gain" , figsize = ( 7 , 6 ), title = "LightGBM Feature Importance (Gain)" ) plt.show() |
Output:
Plot feature importance using Gain
Python3
# Plot feature importance using Split lgb.plot_importance(model, importance_type = "split" , figsize = ( 7 , 6 ), title = "LightGBM Feature Importance (Split)" ) plt.show() |
Output:
The resulting plot provides insights into which features were most influential in the LightGBM model’s predictions, helping in feature selection and model interpretation.
The code demonstrates the complete process of importing libraries, preparing a LightGBM dataset, defining model parameters, training a LightGBM regression model, and visualizing feature importance using the “gain” method.
LightGBM Feature Importance and Visualization
When it comes to machine learning, model performance depends heavily on feature selection and understanding the significance of each feature. LightGBM, an efficient gradient-boosting framework developed by Microsoft, has gained popularity for its speed and accuracy in handling various machine-learning tasks. LightGBM, with its remarkable speed and memory efficiency, finds practical application in a multitude of fields. Its ability to handle large-scale data processing efficiently makes it indispensable in industries like finance, e-commerce, and healthcare, where massive datasets require swift analysis.