Handling Missing Values with Random Forest
Data imputation is a critical challenge in machine learning, with missing values impacting statistical modelling. Random Forest, an ensemble learning method, is a robust solution for accurate predictions, particularly in healthcare. It can handle classification and regression problems, and it is more nuanced than traditional methods. It can handle nan values and decision tree missing values, providing a reliable strategy for data imputation. In this article, we will see how we can handle missing values explicitly using Random Forest.
What are structurally missing data?
Structurally missing data is logically undefined and not random, often due to a lack of applicable data fields. It is not due to error or randomness but logically cannot exist under certain conditions.
Handling Structurally Missing Data:
- Recoding and Filtering: Address structurally missing data by recoding or filtering out instances.
- Modeling Considerations: Incorporate variables with missing data as interaction terms, without main effect.
- Population Considerations: Recognize that missing data represents different populations, and informs decision on data drop or omission.
Understanding and handling structurally missing data is crucial for accurate analysis and modeling, allowing researchers to make informed decisions without bias or inaccuracies.
- MCAR (Missing Completely At Random): Uniform absence of data across all observations, reducing analyzable population and statistical power but not introducing bias.
- MAR (Missing At Random): Missing data linked to observed data but not the missing data, requiring methods like Multiple Imputation and Maximum Likelihood for accurate handling.
- NMAR (Not Missing At Random): Complex scenario where missing data is dependent on unobserved values, challenging standard imputation techniques and requiring specialized methods for accurate analysis.
Imputation Techniques for Handling Missing Values with Random Forest
- Random Forest Imputation: Utilizes Random Forest to handle missing data, with techniques like proximity imputation and on-the-fly imputation for complex datasets. Requires careful parameter tuning but can effectively capture complex data relationships.
- Miss Forest: An efficient data imputation algorithm using Random Forest, able to handle mixed data types without pre-processing and offering robustness with built-in feature selection. It outperforms KNN-Impute and is particularly effective in imputing missing laboratory data for predictive models in medicine.
- MICE Forest: Integrates Random Forest models into MICE for high-precision imputation. It starts with preliminary imputation and refines using Random Forests, offering efficiency in hazard ratio estimates and suitability for complex datasets with missing data.
Handling Missing Values with Random Forest using Python
In this section, we will walk through the process of handling missing values in a dataset using Random Forest as a predictive model. Specifically, we’ll focus on predicting missing ‘Age’ values in the Titanic dataset, which is a classic dataset used in machine learning and data analysis
Step 1: Importing Necessary Libraries
# Import Libraries
import pandas as pd
import numpy as np
Step 2: Loading Datasets
Here, we are using this dataset.
# Importing dataset and setting 'PassengerId' as index
Data = pd.read_csv('Data.csv', index_col='PassengerId')
Data.head()
Output:
Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
PassengerId
1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
Step 3: Data Preprocessing
Handling Missing Values:
The code Data.isnull().sum()
is used to check for missing values in a DataFrame called Data
.
# Missing Values
Data.isnull().sum()
Output:
Survived 0
Pclass 0
Name 0
Sex 0
Age 177
SibSp 0
Parch 0
Ticket 0
Fare 0
Cabin 687
Embarked 2
dtype: int64
- This part of the code removes the ‘Cabin’ column from the DataFrame
Data
. Thecolumns=['Cabin']
argument specifies that we want to drop the ‘Cabin’ column, andaxis=1
indicates that we are dropping a column (as opposed to a row).
# Dropping 'Cabin' column due to missing values
Data = Data.drop(columns=['Cabin'], axis=1)
Data.head()
Output:
Survived Pclass Name Sex Age SibSp Parch Ticket Fare Embarked
PassengerId
1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 S
2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C
3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 S
4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 S
5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 S
- The code
Data.Embarked.value_counts()
is used to count the number of occurrences of each unique value in the ‘Embarked’ column of the DataFrameData
.
Data.Embarked.value_counts()
Output:
S 644
C 168
Q 77
Name: Embarked, dtype: int64
- The code calculates the most frequent category in the ‘Embarked’ column using
value_counts()
. Theindex[0]
part retrieves the first (i.e., most frequent) category from the resulting Series.
# As 'S' is the Most frequent category iam going to replace the null values with Most frequent Value i.e, Median
Data['Embarked'].fillna(Data['Embarked'].value_counts().index[0], inplace=True)
DataWithAge = Data[pd.isnull(Data['Age']) == False]
: This line creates a new DataFrameDataWithAge
that contains only the rows where the ‘Age’ column is not null. It uses thepd.isnull(Data['Age']) == False
condition to select rows where the ‘Age’ column is not null.
DataWithoutAge = Data[pd.isnull(Data['Age'])]
: This line creates a new DataFrameDataWithoutAge
that contains only the rows where the ‘Age’ column is null. It uses thepd.isnull(Data['Age'])
condition to select rows where the ‘Age’ column is null.print(DataWithAge.shape, DataWithoutAge.shape)
: This line prints the shape of the two DataFramesDataWithAge
andDataWithoutAge
. Theshape
attribute of a DataFrame returns a tuple representing the dimensions of the DataFrame (number of rows, number of columns).DataWithAge.shape
will give the number of rows and columns inDataWithAge
.DataWithoutAge.shape
will give the number of rows and columns inDataWithoutAge
.
# Splitting data into sets with and without missing 'Age' values
DataWithAge = Data[pd.isnull(Data['Age']) == False]
DataWithoutAge = Data[pd.isnull(Data['Age'])]
# code
print(DataWithAge.shape, DataWithoutAge.shape)
Output:
(714, 10) (177, 10)
Features
is a list containing the names of the selected features. These features are:- ‘Survived’: Whether the passenger survived or not (1 = Yes, 0 = No)
- ‘Pclass’: Ticket class (1 = 1st, 2 = 2nd, 3 = 3rd)
- ‘Age’: Age of the passenger
- ‘SibSp’: Number of siblings/spouses aboard
- ‘Parch’: Number of parents/children aboard
- ‘Fare’: Passenger fare
# As we Focused on Filling Missing values iam selecting only features that are important.
Features = ['Survived', 'Pclass', 'Age', 'SibSp', 'Parch', 'Fare']
One-Hot Encoding
- One-hot encodes categorical variables (‘Embarked’ and ‘Sex’) in the
DataWithAge
andDataWithoutAge
DataFrames, creating new binary columns for each category. - Selects a subset of features (
Features
) from both DataFrames, including ‘Survived’, ‘Pclass’, ‘Age’, ‘SibSp’, ‘Parch’, and ‘Fare’. - Concatenates the selected features and one-hot encoded columns to create the training set (
TrainSet
) and test set (TestSet
) for further analysis.
# Additionally, categorical variables must be encoded as numeric values. This task can be done using one-hot encoding
one_hot_embarked = pd.get_dummies(DataWithAge['Embarked'], drop_first=True)
one_hot_sex = pd.get_dummies(DataWithAge['Sex'], drop_first=True)
DataWithAge = DataWithAge[Features]
TrainSet = pd.concat([DataWithAge, one_hot_sex, one_hot_embarked], axis=1)
one_hot_embarked = pd.get_dummies(DataWithoutAge['Embarked'], drop_first=True)
one_hot_sex = pd.get_dummies(DataWithoutAge['Sex'], drop_first=True)
DataWithoutAge = DataWithoutAge[Features]
TestSet = pd.concat([DataWithoutAge, one_hot_sex, one_hot_embarked], axis=1)
Step 4: Model Building
- Importing the Random Forest Regressor:
from sklearn.ensemble import RandomForestRegressor
: This line imports theRandomForestRegressor
class from thesklearn.ensemble
module, which is used to train a random forest regression model.
- Creating the Random Forest Regressor:
rf_age = RandomForestRegressor()
: This line creates an instance of theRandomForestRegressor
class, which will be used to train the model.
- Training the Model:
rf_age.fit(TrainSet[Independent_Features], TrainSet['Age'])
: This line trains the random forest regressor model (rf_age
) using the features (Independent_Features
) as input and the ‘Age’ column fromTrainSet
as the target variable. Thefit
method fits the model to the training data, allowing it to learn the relationship between the features and the target variable.
# Now the crucial part. It is the time to train the Random Forest regressor and predict the values of the “Age” column
from sklearn.ensembleimport RandomForestRegressor
rf_age = RandomForestRegressor()
# Training
rf_age.fit(TrainSet[Independent_Features], TrainSet['Age'])
Step 5: Prediction
Predicted_Ages = rf_age.predict(TestSet[Independent_Features])
: This line uses the trained random forest regressor (rf_age
) to predict the ‘Age’ values in the test set (TestSet
). The predict
method takes the independent features (Independent_Features
) from the test set as input and returns an array of predicted ‘Age’ values.
# Predicting missing 'Age' values in the test set
Predicted_Ages = rf_age.predict(TestSet[Independent_Features])
Predicted_Ages
Output:
array([42.85055556, 35.97916667, 14.9 , 33.98904762, 18.7 ,
27.4787528 , 36.16666667, 19.15 , 22.46666667, 33.444 ,
31.494228 , 41.00333333, 19.15 , 24.48333333, 33.6 ,
41.1 , 11.009 , 27.4787528 , 31.494228 , 19.15 ,
31.494228 , 31.494228 , 27.4787528 , 26.44335664, 18.9 ,
31.494228 , 50.64722222, 16.56666667, 29.35 , 29.97451441,
25.18416667, 10.69333333, 35. , 58.9 , 4.23 ,
...
50.64722222, 13.25 , 49.1 , 38.81666667, 25. , 34.2
, 34.645 , 26.60555556, 31.494228 , 38.55 , 10.69333333, 27.325
, 26.60555556, 13.25 , 24.63087302, 27.4787528 , 26.3 ])
- Casting Predicted Ages to Integers:
TestSet['Age'] = Predicted_Ages.astype(int)
: This line casts the predicted ‘Age’ values (Predicted_Ages
) to integers using theastype(int)
method and assigns them to the ‘Age’ column in the test set (TestSet
). This step ensures that the ‘Age’ column contains only integers, consistent with the original dataset.
- Concatenating Training and Test Datasets:
Titanic_set = TrainSet.append(TestSet)
: This line concatenates the training set (TrainSet
) and the modified test set (TestSet
with missing ‘Age’ values replaced by predicted values) to create a final dataset (Titanic_set
) with no missing ‘Age’ values. Theappend
method is used to combine the two datasets along the rows.
# In the original dataset, the “Age” column contains only integers,
#so I am going to cast the generated values to “int” and replace the missing age values with data predicted by the model.
TestSet['Age'] = Predicted_Ages.astype(int)
#concatenates the training and test datasets to create a final dataset with no missing 'Age' values.
Titanic_set = TrainSet.append(TestSet)
# Final Dataset with No Null Values in Age.
Titanic_set.head()
Output:
Survived Pclass Age SibSp Parch Fare male Q S
0 0 3 22.0 1 0 7.2500 True False True
1 1 1 38.0 1 0 71.2833 False False False
2 1 3 26.0 0 0 7.9250 False False True
3 1 1 35.0 1 0 53.1000 False False True
4 0 3 35.0 0 0 8.0500 True False True
- The code
Titanic_set.shape
returns the dimensions of the DataFrameTitanic_set
, which represents the combined dataset containing both the original training data and the test data with missing ‘Age’ values replaced by predicted values. - The shape attribute of a DataFrame provides information about the number of rows and columns in the DataFrame.
Titanic_set.shape
Output:
(891, 9)
- The code
Titanic_set.isnull().sum()
is used to check for missing values in theTitanic_set
DataFrame after replacing missing ‘Age’ values with predicted values.
# Final check for missing values
Titanic_set.isnull().sum()
Output:
Survived 0
Pclass 0
Age 0
SibSp 0
Parch 0
Fare 0
male 0
Q 0
S 0
dtype: int64
The output indicates that there are no missing values in any of the columns of the Titanic_set
DataFrame after replacing missing ‘Age’ values with predicted values and performing one-hot encoding on categorical variables. Each number in the output represents the count of missing values for the corresponding column. Since all counts are 0, it means that there are no missing values in any of the columns.