Importing Libraries

How to prevent overfitting in random forests of python sklearn?

Python libraries simplify data handling and operation-related tasks up to a great extent.

Python3

from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn import metrics

We will load the dummy dataset for a classification task from sklearn.

Python3

X, y = datasets.make_classification()
X_train, X_val, Y_train, Y_val = train_test_split(X,
                                                  y, 
                                                  test_size = 0.2, 
                                                  random_state=2022)
print(X_train.shape, X_val.shape)

Output:

(80, 20) (20, 20)

Let’s train a RandomForestClassifier on this dataset without using any hyperparameters.

Python3

model = RandomForestClassifier()
model.fit(X_train, Y_train)
print('Training Accuracy : ', 
      metrics.accuracy_score(Y_train,
                             model.predict(X_train))*100)
print('Validation Accuracy : ', 
      metrics.accuracy_score(Y_val, 
                             model.predict(X_val))*100)

Output:

Training Accuracy :  100.0
Validation Accuracy :  75.0

Here we can see that the training accuracy is 100% but the validation accuracy is just 75% which is less compared to the case of training accuracy which means that the model is overfitting to the training data. To solve this problem first let’s use the parameter max_depth.

Python3

model = RandomForestClassifier(max_depth=2, 
                               random_state=22)
model.fit(X_train, Y_train)
print('Training Accuracy : ',
      metrics.accuracy_score(Y_train, 
                             model.predict(X_train))*100)
print('Validation Accuracy : ', 
      metrics.accuracy_score(Y_val, 
                             model.predict(X_val))*100)

Output:

Training Accuracy :  95.0
Validation Accuracy :  75.0

From a difference of 25%, we have achieved a difference of 20% by just tuning the value o one hyperparameter. Similarly, let’s use the n_estimators.

Python3

model = RandomForestClassifier(n_estimators=30, 
                               random_state=22)
model.fit(X_train, Y_train)
print('Training Accuracy : ', 
      metrics.accuracy_score(Y_train,
                             model.predict(X_train))*100)
print('Validation Accuracy : ',
      metrics.accuracy_score(Y_val, 
                             model.predict(X_val))*100)

Output:

Training Accuracy :  100.0
Validation Accuracy :  85.0

Again by pruning another hyperparameter, we are able to solve the problem of overfitting even more.

Python3

model = RandomForestClassifier(
    max_depth=2, n_estimators=30,
    min_samples_split=3, max_leaf_nodes=5,
    random_state=22)
 
model.fit(X_train, Y_train)
print('Training Accuracy : ',
      metrics.accuracy_score(
          Y_train, model.predict(X_train))*100)
 
print('Validation Accuracy : ', metrics.accuracy_score(
    Y_val, model.predict(X_val))*100)

Output:

Training Accuracy :  95.0
Validation Accuracy :  80.0

As shown above we can use multiple parameters as well to prune the overfitting easily.

How to Solve Overfitting in Random Forest in Python Sklearn?

In this article, we are going to see the how to solve overfitting in Random Forest in Sklearn Using Python.

Importing Libraries

Python3

Python3

Python3

Python3

Python3

Python3

How to Solve Overfitting in Random Forest in Python Sklearn?

Similar Reads