Feature Importance in Random Forests: Implementation

To show implementation, The iris dataset is used throughout the article to understand the implementation of feature importance.

Prerequisities: Install necessary libraries

!pip install shap

Python3

from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
import pandas as pd
import shap
from sklearn.metrics import accuracy_score
import numpy as np

iris = load_iris()
X = iris.data
y = iris.target
feature_names = iris.feature_names

X_train, X_test, y_train, y_test = train_test_split( X, y, test_size = 0.25, random_state=42) # Split dataset into 75% train and 25% test
clf = RandomForestClassifier(n_estimators=100, random_state=42)
clf.fit(X_train, y_train)

Method 1: Built-in feature importance with Scikit Learn

Gini Importance: The importance of a feature is computed as the (normalized) total reduction of the criterion brought by that feature. The criterion is the Gini impurity, which measures the impurity of a node in a decision tree, with more substantial weight to the most important features. Therefore, Gini importance is also known as the total decrease in node impurity.

Gini Importance extends this concept to evaluate the contribution of each feature across multiple trees. When building a Random Forest, the algorithm constructs an ensemble of decision trees by repeatedly sampling the dataset and creating diverse subsets. For each tree, Gini Importance quantifies the reduction in Gini impurity achieved by splitting nodes based on a particular feature. Features that consistently lead to greater impurity reduction across the ensemble are assigned higher Gini Importance scores.

The implementation code below calculates and displays the Gini Importance of features in a machine learning model (clf).

Python3

# Built-in feature importance (Gini Importance)
importances = clf.feature_importances_
feature_imp_df = pd.DataFrame({'Feature': feature_names, 'Gini Importance': importances}).sort_values('Gini Importance', ascending=False) 
print(feature_imp_df)

Output:

             Feature  Gini Importance
2  petal length (cm)         0.437635
3   petal width (cm)         0.423137
0  sepal length (cm)         0.109683
1   sepal width (cm)         0.029545

Visualization plot for feature importance

Python3

# Create a bar plot for feature importance
plt.figure(figsize=(8, 4))
plt.barh(feature_names, importances, color='skyblue')
plt.xlabel('Gini Importance')
plt.title('Feature Importance - Gini Importance')
plt.gca().invert_yaxis()  # Invert y-axis for better visualization
plt.show()

Output:

Mean Decrease Accuracy: In the Mean Decrease Accuracy (MDA) measures the average reduction in model accuracy across all out-of-bag cross-validated predictions when a particular feature undergoes permutation after training but before making predictions.

It offers a comprehensive evaluation of a feature’s significance in influencing the model’s performance. MDA inherently adopts a global perspective, taking into account the entire dataset to gauge the importance of individual features.

The code Iterates through each feature in the dataset. For each feature, it creates a copy of the test set and shuffles the values of that specific feature.
Calculating the accuracy of the classifier on the modified test set and comparing the shuffled accuracy with the initial accuracy to determine the decrease in accuracy caused by shuffling that specific feature.

Python3

# Mean Decrease Accuracy
importances = []
initial_accuracy = accuracy_score(y_test, clf.predict(X_test)) # initial accuracy
for i in range(X.shape[1]):
    X_test_copy = X_test.copy()
    np.random.shuffle(X_test_copy[:, i])
    shuff_accuracy = accuracy_score(y_test, clf.predict(X_test_copy))
    importances.append(initial_accuracy - shuff_accuracy)
    
accuracy_df = pd.DataFrame({'Feature': feature_names, 'Decrease in Accuracy': importances}).sort_values('Decrease in Accuracy', ascending=False)
print(accuracy_df)

Output:

             Feature  Decrease in Accuracy
2  petal length (cm)              0.263158
3   petal width (cm)              0.157895
0  sepal length (cm)              0.000000
1   sepal width (cm)              0.000000

Visualization plot for feature importance

Python3

# Create a bar plot for feature importance
plt.figure(figsize=(8, 4))
plt.barh(feature_names, importances, color='skyblue')
plt.xlabel('Mean Decrease Accuracy')
plt.title('Feature Importance - Mean Decrease Accuracy')
plt.gca().invert_yaxis() 
plt.show()

Output:

Method 2: Permutation feature importance

Permutation importance assesses the significance of each feature independently in out-of-bag cross-validated predictions, providing a more reliable and robust measure.

By evaluating the impact of individual feature permutations on predictions, it calculates importance. This method involves randomly shuffling a single feature’s values and measuring the resulting performance decrease (e.g., accuracy, F1-score). Comparing the model’s original and permuted values allows for a precise evaluation of feature importance.

The code uses the permutation_importance function to calculate permutation feature importance for each feature in a trained classifier, where:

n_repeats: Specifies how many times each feature is permuted to calculate the importance.
n_jobs: Determines the number of parallel jobs to run. Setting it to -1 utilizes all available CPU cores, improving computation speed.

Python3

# Permutation feature importance
from sklearn.inspection import permutation_importance
result = permutation_importance(clf, X_test, y_test, n_repeats=10, random_state=0, n_jobs=-1)
perm_imp_df = pd.DataFrame({'Feature': feature_names, 'Permutation Importance': result.importances_mean}).sort_values('Permutation Importance', ascending=False)
print(perm_imp_df)

Output:

             Feature  Permutation Importance
2  petal length (cm)                0.210526
3   petal width (cm)                0.144737
0  sepal length (cm)                0.000000
1   sepal width (cm)                0.000000

Here’s what the permutation importance values suggest in this output:

“Petal length (cm)” has the highest permutation importance value (0.210526), indicating that shuffling the values of this feature leads to a significant decrease in the model’s performance.
“Petal width (cm)” also has a relatively high permutation importance value (0.144737), though lower than “Petal length (cm)”.
“Sepal length (cm)” and “Sepal width (cm)” have permutation importance values of 0, suggesting that shuffling the values of these features does not affect the model’s performance significantly.

Visualization plot for feature importance

Python3

plt.figure(figsize=(6, 6))
plt.bar(perm_imp_df['Feature'], perm_imp_df['Permutation Importance'])
plt.xlabel('Feature')
plt.ylabel('Permutation Importance')
plt.title('Permutation Feature Importance')
plt.xticks(rotation=45, ha='right')  # Adjust rotation for better visibility
plt.tight_layout()
plt.show()

Output:

Method 3: SHAP values

SHAP (SHapley Additive exPlanations) value is a measure of the contribution of a feature towards the prediction for each instance.

For a given prediction made by a Random Forest model, SHAP values can be calculated for each feature for that specific instance.
It involves considering all possible subsets of features and their interactions, measuring the impact of including or excluding each feature on the prediction.

Positive SHAP values indicate a positive contribution to the prediction, while negative values suggest a negative contribution. The magnitude of the SHAP value represents the strength of the contribution.

Python3

shap_summary = np.abs(shap_values).mean(axis=0) 
shap_summary_df = pd.DataFrame({'Feature': feature_names[:-1], 'SHAP values': shap_summary})
shap_summary_df = shap_summary_df.sort_values('SHAP values', ascending=False)
print(shap_summary_df)

# Visualization
plt.figure(figsize=(10, 6))
plt.barh(shap_summary_df['Feature'], shap_summary_df['SHAP values'], color='skyblue')
plt.xlabel('Mean Absolute SHAP Value')
plt.ylabel('Feature')
plt.title('Feature Importance based on SHAP Values')
plt.gca().invert_yaxis() 
plt.show()

Output:

             Feature  SHAP values
0  sepal length (cm)     0.780402
1   sepal width (cm)     0.604297
2  petal length (cm)     0.292078

Shap Values

Feature Importance in Random Forests: Implementation

Method 1: Built-in feature importance with Scikit Learn

Visualization plot for feature importance

Visualization plot for feature importance

Method 2: Permutation feature importance

Visualization plot for feature importance

Method 3: SHAP values

Feature Importance with Random Forests

Similar Reads