Stepwise Guide to Select the Best Features

Implementation on real life dataset

Importing neccessary libraries


import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.feature_selection import SelectKBest, SelectPercentile, RFE, SelectFromModel
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

Loading and Splitting Dataset

  • The X variable contains the features (sepal length, sepal width, petal length, petal width) of the flowers, and the y variable contains the target labels (species of iris).
  • The train_test_split function is used to split the data, where 70% of the data is used for training (X_train, y_train) and 30% is used for testing (X_test, y_test).
  • The random_state parameter is set to 42 to ensure reproducibility of the split.


# Load the iris dataset
iris = load_iris()
X, y =,
# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

1. Correlation for feature selection

Using correlation is a common technique to identify relationships between features in a dataset. It helps in understanding which features are closely related and can potentially be redundant. Here’s how you can use correlation:

  1. Calculate Correlation Matrix: Compute the correlation matrix of your dataset. This matrix shows the correlation coefficient between each pair of features.
  2. Visualize Correlation Matrix: Create a heatmap of the correlation matrix to visualize the relationships between features. A higher correlation coefficient (closer to 1 or -1) indicates a stronger relationship.
  3. Identify Important Features: Look for features that are highly correlated with the target variable (if it’s a supervised learning problem) or with other features. Features with high correlation with the target are often good predictors, while features with high correlation with other features may be redundant.
  4. Select Features: Based on the correlation matrix, you can decide to keep, drop, or further investigate certain features. High correlations between features may indicate multicollinearity, which can adversely affect some models (like linear regression).


# Calculate the correlation matrix
iris_corr_matrix = iris_df.corr()
# Create a heatmap of the correlation matrix
plt.figure(figsize=(8, 6))
sns.heatmap(iris_corr_matrix, annot=True, cmap='coolwarm', vmin=-1, vmax=1)
plt.title('Correlation Matrix for Iris Dataset')


Correlation Matrix

Here, we see that with target is mostly correlated with petal length and petal width, thus if we want to get the most important features we can select those too.

If you were to perform feature selection or feature importance analysis on this dataset, selecting petal length and petal width as the most important features could be a reasonable approach based on their high correlation with the target variable. These features provide valuable information for distinguishing between different species of iris.

2. Filter Methods:

Here’s a brief definition of each of the filter methods:

  1. SelectKBest: Selects the k highest-scoring features based on a scoring function (e.g., ANOVA F-value, chi-squared statistic) that measures the relevance of each feature to the target variable.
  2. SelectPercentile: Selects the top features based on a percentile of the highest scores. For example, if you set the percentile to 10, it will select the top 10% of features.
  3. RFE (Recursive Feature Elimination): Recursively removes the least important features and fits the model again until the desired number of features is reached. It uses the model’s feature importance (or coefficient) to determine which features to eliminate.
  4. SelectFromModel: Selects features based on a threshold of importance weights computed by an estimator (e.g., decision trees, linear models). It keeps only the features whose importance exceeds the threshold.
  5. Random Forest correlations: Calculates the feature importances from a Random Forest model and uses them as a filter method. Features with higher importances are considered more important.

A. SelectKBest:

SelectKBest Selects the top k features based on a scoring function. In this case, k=2 is used to select the top 2 features.


# SelectKBest
selector_kbest = SelectKBest(k=2) # Select top 2 features
X_train_kbest = selector_kbest.fit_transform(X_train, y_train)

B. SelectPercentile:

Selects the top features based on a percentile of the highest scores. Here, percentile=50 is used to select the top 50% of features.


# SelectPercentile
selector_percentile = SelectPercentile(percentile=50) # Select top 50% features
X_train_percentile = selector_percentile.fit_transform(X_train, y_train)

C. RFE (Recursive Feature Elimination):

Recursively removes the least important features until the desired number of features is reached. n_features_to_select=2 is used to select the top 2 features.


# RFE (Recursive Feature Elimination)
estimator = LogisticRegression(max_iter=1000) # Base estimator
selector_rfe = RFE(estimator=estimator, n_features_to_select=2) # Select top 2 features
X_train_rfe = selector_rfe.fit_transform(X_train, y_train)

D. SelectFromModel:

Selects features based on a threshold of importance weights computed by an estimator. threshold='mean' is used to select features with importance greater than the mean importance.


E. Random Forest correlations:

Uses feature importances from a Random Forest model as a filter method. Features with importance greater than threshold=0.1 are selected.


# SelectFromModel
estimator = RandomForestClassifier() # Base estimator
selector_model = SelectFromModel(estimator=estimator, threshold='mean') # Select features with importance greater than mean
X_train_model = selector_model.fit_transform(X_train, y_train)
# Random Forest correlations
model = RandomForestClassifier(), y_train)
importances = model.feature_importances_
threshold = 0.1
X_train_rf_corr = X_train[:, importances > threshold]

Comparing the selected features by different methods


# Display selected features for each method
print("Selected features using SelectKBest:")
print("\nSelected features using SelectPercentile:")
print("\nSelected features using RFE:")
print("\nSelected features using SelectFromModel:")
print("\nSelected features using Random Forest correlations:")
print([i for i in range(len(importances)) if importances[i] > threshold])


Selected features using SelectKBest:
[2 3]

Selected features using SelectPercentile:
[2 3]

Selected features using RFE:
[2 3]

Selected features using SelectFromModel:
[2 3]

Selected features using Random Forest correlations:
[2, 3]

The selected features using different methods are the same, which indicates that all methods agree on the importance of features 2 and 3. Features in the Iris dataset correspond to petal length and petal width, which are known to be highly informative for iris species classification.

How can Feature Selection reduce overfitting?

The development of precise models is essential for predicted performance in the rapidly developing area of machine learning. The possibility of overfitting, in which a model picks up noise and oscillations unique to the training set in addition to the underlying patterns in the data, presents an inherent problem. As a potent weapon against overfitting, feature selection offers a way to improve the efficiency and generality of the model.

