Importing Libraries

Python libraries simplify data handling and operation-related tasks up to a great extent.

Python3




from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn import metrics


We will load the dummy dataset for a classification task from sklearn.

Python3




X, y = datasets.make_classification()
X_train, X_val, Y_train, Y_val = train_test_split(X,
                                                  y,
                                                  test_size = 0.2,
                                                  random_state=2022)
print(X_train.shape, X_val.shape)


Output:

(80, 20) (20, 20)

Let’s train a RandomForestClassifier on this dataset without using any hyperparameters.

Python3




model = RandomForestClassifier()
model.fit(X_train, Y_train)
print('Training Accuracy : ',
      metrics.accuracy_score(Y_train,
                             model.predict(X_train))*100)
print('Validation Accuracy : ',
      metrics.accuracy_score(Y_val,
                             model.predict(X_val))*100)


Output:

Training Accuracy :  100.0
Validation Accuracy :  75.0

Here we can see that the training accuracy is 100% but the validation accuracy is just 75% which is less compared to the case of training accuracy which means that the model is overfitting to the training data. To solve this problem first let’s use the parameter max_depth.

Python3




model = RandomForestClassifier(max_depth=2,
                               random_state=22)
model.fit(X_train, Y_train)
print('Training Accuracy : ',
      metrics.accuracy_score(Y_train,
                             model.predict(X_train))*100)
print('Validation Accuracy : ',
      metrics.accuracy_score(Y_val,
                             model.predict(X_val))*100)


Output:

Training Accuracy :  95.0
Validation Accuracy :  75.0

From a difference of 25%, we have achieved a difference of 20% by just tuning the value o one hyperparameter. Similarly, let’s use the n_estimators.

Python3




model = RandomForestClassifier(n_estimators=30,
                               random_state=22)
model.fit(X_train, Y_train)
print('Training Accuracy : ',
      metrics.accuracy_score(Y_train,
                             model.predict(X_train))*100)
print('Validation Accuracy : ',
      metrics.accuracy_score(Y_val,
                             model.predict(X_val))*100)


Output:

Training Accuracy :  100.0
Validation Accuracy :  85.0

Again by pruning another hyperparameter, we are able to solve the problem of overfitting even more.

Python3




model = RandomForestClassifier(
    max_depth=2, n_estimators=30,
    min_samples_split=3, max_leaf_nodes=5,
    random_state=22)
 
model.fit(X_train, Y_train)
print('Training Accuracy : ',
      metrics.accuracy_score(
          Y_train, model.predict(X_train))*100)
 
print('Validation Accuracy : ', metrics.accuracy_score(
    Y_val, model.predict(X_val))*100)


Output:

Training Accuracy :  95.0
Validation Accuracy :  80.0

As shown above we can use multiple parameters as well to prune the overfitting easily.

How to Solve Overfitting in Random Forest in Python Sklearn?

In this article, we are going to see the how to solve overfitting in Random Forest in Sklearn Using Python.

Similar Reads

What is overfitting?

Overfitting is a common phenomenon you should look out for any time you are training a machine learning model. Overfitting happens when a model learns the pattern as well as the noise of the data on which the model is trained. Specifically, the model picks up on patterns that are specific to the observations in the training data but do not generalize to other observations. And hence the model is able to make great predictions on the data it was trained on but is not able to make good predictions on data it did not see during training....

Why is overfitting a problem?

Overfitting is a problem because machine learning models are generally trained with the intention of making predictions on unseen data. Models which overfit their training data set are not able to make good predictions on new data that they did not see during training, so they are not able to make predictions on unseen data....

How do you check whether your model is overfitting to the training data?

In order to check whether your model is overfitting to the training data, you should make sure to split your dataset into a training dataset that is used to train your model and a test dataset that is not touched at all during model training. This way you will have a dataset available that the model did not see at all during training that you can use to assess whether your model is overfitting....

How to prevent overfitting in random forests of python sklearn?

Hyperparameter tuning is the answer for any such question where we want to boost the performance of a model without any change in the dataset available. But before exploring which hyperparameters can help us let’s understand how the random forest model works....

Importing Libraries

Python libraries simplify data handling and operation-related tasks up to a great extent....

Conclusion

...