How to prevent overfitting in random forests of python sklearn?
Hyperparameter tuning is the answer for any such question where we want to boost the performance of a model without any change in the dataset available. But before exploring which hyperparameters can help us let’s understand how the random forest model works.
A random forest model is a stack of multiple decision trees and by combining the results of each decision tree accuracy shot up drastically. Based on this simple explanation of the random forest model there are multiple hyperparameters that we can tune while loading an instance of the random forest model which helps us to prune overfitting.
- max_depth: This controls how deep or the number of layers deep we will have our decision trees up to.
- n_estimators: This controls the number of decision trees that will be there in each layer. This and the previous parameter solves the problem of overfitting up to a great extent.
- criterion: While training a random forest data is split into parts and this parameter controls how these splits will occur.
- min_samples_leaf: This determines the minimum number of leaf nodes.
- min_samples_split: This determines the minimum number of samples required to split the code.
- max_leaf_nodes: This determines the maximum number of leaf nodes.
There are more parameters that we can tune to prune the overfitting problem but the parameters mentioned above are more effective in serving the purpose most of the time.
Note:-
A random forest model can be loaded without thinking about these hyperparameters as well because some default value is always assigned to these parameters and we can control them explicitly to serve our purpose.
Now lets us explore these hyperparameters a bit using datasets.
How to Solve Overfitting in Random Forest in Python Sklearn?
In this article, we are going to see the how to solve overfitting in Random Forest in Sklearn Using Python.