Importing Libraries
Python libraries simplify data handling and operation-related tasks up to a great extent.
Python3
from sklearn import datasets from sklearn.model_selection import train_test_split from sklearn.ensemble import RandomForestClassifier from sklearn import metrics |
We will load the dummy dataset for a classification task from sklearn.
Python3
X, y = datasets.make_classification() X_train, X_val, Y_train, Y_val = train_test_split(X, y, test_size = 0.2 , random_state = 2022 ) print (X_train.shape, X_val.shape) |
Output:
(80, 20) (20, 20)
Let’s train a RandomForestClassifier on this dataset without using any hyperparameters.
Python3
model = RandomForestClassifier() model.fit(X_train, Y_train) print ( 'Training Accuracy : ' , metrics.accuracy_score(Y_train, model.predict(X_train)) * 100 ) print ( 'Validation Accuracy : ' , metrics.accuracy_score(Y_val, model.predict(X_val)) * 100 ) |
Output:
Training Accuracy : 100.0 Validation Accuracy : 75.0
Here we can see that the training accuracy is 100% but the validation accuracy is just 75% which is less compared to the case of training accuracy which means that the model is overfitting to the training data. To solve this problem first let’s use the parameter max_depth.
Python3
model = RandomForestClassifier(max_depth = 2 , random_state = 22 ) model.fit(X_train, Y_train) print ( 'Training Accuracy : ' , metrics.accuracy_score(Y_train, model.predict(X_train)) * 100 ) print ( 'Validation Accuracy : ' , metrics.accuracy_score(Y_val, model.predict(X_val)) * 100 ) |
Output:
Training Accuracy : 95.0 Validation Accuracy : 75.0
From a difference of 25%, we have achieved a difference of 20% by just tuning the value o one hyperparameter. Similarly, let’s use the n_estimators.
Python3
model = RandomForestClassifier(n_estimators = 30 , random_state = 22 ) model.fit(X_train, Y_train) print ( 'Training Accuracy : ' , metrics.accuracy_score(Y_train, model.predict(X_train)) * 100 ) print ( 'Validation Accuracy : ' , metrics.accuracy_score(Y_val, model.predict(X_val)) * 100 ) |
Output:
Training Accuracy : 100.0 Validation Accuracy : 85.0
Again by pruning another hyperparameter, we are able to solve the problem of overfitting even more.
Python3
model = RandomForestClassifier( max_depth = 2 , n_estimators = 30 , min_samples_split = 3 , max_leaf_nodes = 5 , random_state = 22 ) model.fit(X_train, Y_train) print ( 'Training Accuracy : ' , metrics.accuracy_score( Y_train, model.predict(X_train)) * 100 ) print ( 'Validation Accuracy : ' , metrics.accuracy_score( Y_val, model.predict(X_val)) * 100 ) |
Output:
Training Accuracy : 95.0 Validation Accuracy : 80.0
As shown above we can use multiple parameters as well to prune the overfitting easily.
How to Solve Overfitting in Random Forest in Python Sklearn?
In this article, we are going to see the how to solve overfitting in Random Forest in Sklearn Using Python.