Implementation: PowerTransformer in Scikit-Learn
To use the ‘PowerTransformer'
in scikit-learn, follow these steps:
Step 1: Import Libraries
Here, we import necessary libraries: PowerTransformer from scikit-learn for applying the Yeo-Johnson transformation, numpy for numerical operations, and matplotlib.pyplot for data visualization.
from sklearn.preprocessing import PowerTransformer
import numpy as np
import matplotlib.pyplot as plt
Step 2: Generating Skewed Data
We use np.random.exponential
to generate 1000 samples from an exponential distribution. This creates skewed data, as exponential distributions typically have a right-skewed shape. We visualize the original data using a histogram with 30 bins. This provides a visual representation of the skewness in the data.
# Generating skewed data
np.random.seed(0)
data = np.random.exponential(size=1000)
# Visualizing the original data
plt.hist(data, bins=30, alpha=0.7, label='Original Data')
plt.legend()
plt.show()
Output:
Step 3: Applying PowerTransformer
We initialize the PowerTransformer
object with the method parameter set to 'yeo-johnson'
. This indicates that we want to apply the Yeo-Johnson transformation. The standardize
parameter is set to True
, which means the transformed data will be standardized (mean=0, variance=1). We reshape the data to have a single feature (as required by scikit-learn), then apply the Yeo-Johnson transformation using the fit_transform
method of the PowerTransformer
object. Finally, we visualize the transformed data using a histogram with 30 bins. This allows us to observe how the transformation has affected the distribution of the data.
# Initialize the PowerTransformer
pt = PowerTransformer(method='yeo-johnson', standardize=True)
# Transform the data
data_transformed = pt.fit_transform(data.reshape(-1, 1))
# Visualizing the transformed data
plt.hist(data_transformed, bins=30, alpha=0.7, color='green', label='Transformed Data')
plt.legend()
plt.show()
Output:
PowerTransformer in scikit-learn
When it comes to data preprocessing, machine learning algorithms perform better when variables are transformed to fit a more Gaussian distribution. PowerTransformer is a scikit-learn library that is used to transform to fit Gaussian distribution. The article aims to explore PowerTransfoer technique, its methods along with implementation in scikit-learn.
Table of Content
- What is a PowerTransformer?
- How Does PowerTransformer Work?
- Box-Cox Transform
- Yeo-Johnson Transform
- Implementation: PowerTransformer in Scikit-Learn
- Step 1: Import Libraries
- Step 2: Generating Skewed Data
- Step 3: Applying PowerTransformer
- Advantages of PowerTransformer