PowerTransformer in scikit-learn

When it comes to data preprocessing, machine learning algorithms perform better when variables are transformed to fit a more Gaussian distribution. PowerTransformer is a scikit-learn library that is used to transform to fit Gaussian distribution. The article aims to explore PowerTransfoer technique, its methods along with implementation in scikit-learn.

Table of Content

  • What is a PowerTransformer?
  • How Does PowerTransformer Work?
    • Box-Cox Transform
    • Yeo-Johnson Transform
  • Implementation: PowerTransformer in Scikit-Learn
    • Step 1: Import Libraries
    • Step 2: Generating Skewed Data
    • Step 3: Applying PowerTransformer
  • Advantages of PowerTransformer

What is a PowerTransformer?

The PowerTransformer is a technique used to make numerical data resemble a Gaussian distribution more closely, which is often required for many machine learning models that operate under the assumption of normal distribution. It is especially valuable in situations where data shows significant skewness or kurtosis. By stabilizing variance and reducing skewness, the PowerTransformer helps to reinforce the foundational statistical assumptions, thus enhancing the effectiveness of the model.

How Does PowerTransformer Work?

The ‘PowerTransformer’ supports two main transformations:

  1. Box-Cox Transform
  2. Yeo-Johnson Transform

Both of these methods are used to compute optimal transformation parameter that normalizes the data.

Box-Cox Transform

The Box-Cox transformation is a statistical method used to stabilize variance and make data more closely meet the assumptions of normality. The Box-Cox transformation can be applied to positive data. The transformation is parameterized by value, which varies to find the best approximation of a normal distribution.

The formula for the Box-Cox transformation is:

This transformation helps improve the validity of many statistical techniques that assume normality.

Yeo-Johnson Transform

The Yeo-Johnson transformation, an extension of the Box-Cox method, serves to stabilize variance and normalize data distributions, rendering it more adaptable for real-world scenarios by accommodating both positive and negative data values.

The transformation is defined as follows for values of and y:

Implementation: PowerTransformer in Scikit-Learn

To use the ‘PowerTransformer' in scikit-learn, follow these steps:

Step 1: Import Libraries

Here, we import necessary libraries: PowerTransformer from scikit-learn for applying the Yeo-Johnson transformation, numpy for numerical operations, and matplotlib.pyplot for data visualization.

Python
from sklearn.preprocessing import PowerTransformer
import numpy as np
import matplotlib.pyplot as plt


Step 2: Generating Skewed Data

We use np.random.exponential to generate 1000 samples from an exponential distribution. This creates skewed data, as exponential distributions typically have a right-skewed shape. We visualize the original data using a histogram with 30 bins. This provides a visual representation of the skewness in the data.

Python
# Generating skewed data
np.random.seed(0)
data = np.random.exponential(size=1000)

# Visualizing the original data
plt.hist(data, bins=30, alpha=0.7, label='Original Data')
plt.legend()
plt.show()

Output:


Skewed Dataset



Step 3: Applying PowerTransformer

We initialize the PowerTransformer object with the method parameter set to 'yeo-johnson'. This indicates that we want to apply the Yeo-Johnson transformation. The standardize parameter is set to True, which means the transformed data will be standardized (mean=0, variance=1). We reshape the data to have a single feature (as required by scikit-learn), then apply the Yeo-Johnson transformation using the fit_transform method of the PowerTransformer object. Finally, we visualize the transformed data using a histogram with 30 bins. This allows us to observe how the transformation has affected the distribution of the data.

Python
# Initialize the PowerTransformer
pt = PowerTransformer(method='yeo-johnson', standardize=True)

# Transform the data
data_transformed = pt.fit_transform(data.reshape(-1, 1))

# Visualizing the transformed data
plt.hist(data_transformed, bins=30, alpha=0.7, color='green', label='Transformed Data')
plt.legend()
plt.show()

Output:


Normally Distributed Data


Advantages of PowerTransformer

  • Handling Skewed Data: Many real-world datasets exhibit skewness, where the distribution of values is asymmetric. PowerTransformer can effectively mitigate this skewness, making the data distribution more symmetrical, which can benefit the performance of certain machine learning algorithms.
  • Preservation of Rank Order: Unlike some other transformations, such as min-max scaling, PowerTransformer preserves the rank order of the data. This is important when the relative ordering of values carries meaningful information, as is often the case in many applications.
  • Robustness to Outliers: PowerTransformer is relatively robust to outliers compared to some other transformations. Outliers can significantly impact the performance of models, and the ability to handle them effectively is a valuable asset.

Conclusion

The PowerTransformer technique, through Box-Cox and Yeo-Johnson transformations, optimally normalizes numerical data, vital for enhancing machine learning model performance by aligning with Gaussian distributions. Its robustness to outliers, preservation of rank order, and effectiveness in handling skewed data make it a valuable asset in data preprocessing for various machine learning applications.