Utilizing Quantile Transformer for Outlier Detection in Scikit-learn

Scikit-Learn provides a handy class to take care of data transformation using quantile functions. The details are as follows:

class sklearn.preprocessing.QuantileTransformer(*, n_quantiles=1000, output_distribution=’uniform’, ignore_implicit_zeros=False, subsample=10000, random_state=None, copy=True)

Parameters:

  • n_quantiles (int, default=1000): Number of quantiles to be computed.
  • output_distribution: It is the marginal distribution for the transformed data. The choices are ‘uniform’ (default) or ‘normal’.
  • ignore_implicit_zeros (bool, default=False): Only applies to sparse matrices. If True, it computes the the quantile statistics by discarding the sparse entries of the matrix. If False, these entries are treated as zeros.
  • subsample (int): Maximum number of samples used to estimate the quantiles for computational efficiency.
  • random_state (int, default=None): Determines random number generation for subsampling and smoothing noise.
  • copy (bool, default=True): Set to False to perform inplace transformation and avoid a copy (if the input is already a numpy array).

In sklearn, the QuantileTransformer class divides the entire data into n_quantiles, which in turn sets the number of landmarks. Based on the landmark, the transformer discretizes the cumulative distribution function, and the values are mapped to a uniform or normal distribution. Here, the outliers have a lesser impact on distribution since the CDF is calculated and the quantile function is applied based on each landmark.

Now let’s try a quantile data transformer using a real-world set. We can make use of the Ames housing dataset. The code is as follows:

Python

from sklearn.datasets import fetch_openml ames_housing = fetch_openml( name="house_prices", parser="auto", as_frame=True) X = ames_housing.data X.info()

Output

<class 'pandas.core.frame.DataFrame'> RangeIndex: 1460 entries, 0 to 1459 Data columns (total 80 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Id 1460 non-null int64 1 MSSubClass 1460 non-null int64 2 MSZoning 1460 non-null object 3 LotFrontage 1201 non-null float64 4 LotArea 1460 non-null int64 5 Street 1460 non-null object 6 Alley 91 non-null object 7 LotShape 1460 non-null object 8 LandContour 1460 non-null object 9 Utilities 1460 non-null object 10 LotConfig 1460 non-null object 11 LandSlope 1460 non-null object 12 Neighborhood 1460 non-null object 13 Condition1 1460 non-null object 14 Condition2 1460 non-null object 15 BldgType 1460 non-null object 16 HouseStyle 1460 non-null object 17 OverallQual 1460 non-null int64 18 OverallCond 1460 non-null int64 19 YearBuilt 1460 non-null int64 20 YearRemodAdd 1460 non-null int64 21 RoofStyle 1460 non-null object 22 RoofMatl 1460 non-null object 23 Exterior1st 1460 non-null object 24 Exterior2nd 1460 non-null object 25 MasVnrType 1452 non-null object 26 MasVnrArea 1452 non-null float64 27 ExterQual 1460 non-null object 28 ExterCond 1460 non-null object 29 Foundation 1460 non-null object 30 BsmtQual 1423 non-null object 31 BsmtCond 1423 non-null object 32 BsmtExposure 1422 non-null object 33 BsmtFinType1 1423 non-null object 34 BsmtFinSF1 1460 non-null int64 35 BsmtFinType2 1422 non-null object 36 BsmtFinSF2 1460 non-null int64 37 BsmtUnfSF 1460 non-null int64 38 TotalBsmtSF 1460 non-null int64 39 Heating 1460 non-null object 40 HeatingQC 1460 non-null object 41 CentralAir 1460 non-null object 42 Electrical 1459 non-null object 43 1stFlrSF 1460 non-null int64 44 2ndFlrSF 1460 non-null int64 45 LowQualFinSF 1460 non-null int64 46 GrLivArea 1460 non-null int64 47 BsmtFullBath 1460 non-null int64 48 BsmtHalfBath 1460 non-null int64 49 FullBath 1460 non-null int64 50 HalfBath 1460 non-null int64 51 BedroomAbvGr 1460 non-null int64 52 KitchenAbvGr 1460 non-null int64 53 KitchenQual 1460 non-null object 54 TotRmsAbvGrd 1460 non-null int64 55 Functional 1460 non-null object 56 Fireplaces 1460 non-null int64 57 FireplaceQu 770 non-null object 58 GarageType 1379 non-null object 59 GarageYrBlt 1379 non-null float64 60 GarageFinish 1379 non-null object 61 GarageCars 1460 non-null int64 62 GarageArea 1460 non-null int64 63 GarageQual 1379 non-null object 64 GarageCond 1379 non-null object 65 PavedDrive 1460 non-null object 66 WoodDeckSF 1460 non-null int64 67 OpenPorchSF 1460 non-null int64 68 EnclosedPorch 1460 non-null int64 69 3SsnPorch 1460 non-null int64 70 ScreenPorch 1460 non-null int64 71 PoolArea 1460 non-null int64 72 PoolQC 7 non-null object 73 Fence 281 non-null object 74 MiscFeature 54 non-null object 75 MiscVal 1460 non-null int64 76 MoSold 1460 non-null int64 77 YrSold 1460 non-null int64 78 SaleType 1460 non-null object 79 SaleCondition 1460 non-null object dtypes: float64(3), int64(34), object(43) memory usage: 912.6+ KB

Here we make use of the fetch_openml() method to retrieve the AMES housing dataset. You can notice that there are a total of 80 columns, which include numeric values and objects.

To make the computation easier, we will only fetch numeric values with non-null columns.

Python

# only numeric columns X = X.select_dtypes(np.number) # Remove columns with NaN or Inf values X = X.drop(columns=["LotFrontage", "GarageYrBlt", "MasVnrArea"])

Here we selected the columns of type numeric by passing the number type to the select_dtype() method and also removed numeric columns with NaN or Inf values (less than 1460 entries) using the drop() method.

Let’s fetch the target price and convert it to k$.

Python

# fetch target data y = ames_housing.target # convert the price be in k$ y = y/ 1000 print(y)

Output:

0 208.500 1 181.500 2 223.500 3 140.000 4 250.000 ... 1455 175.000 1456 210.000 1457 266.500 1458 142.125 1459 147.500 Name: SalePrice, Length: 1460, dtype: float64

Next, we can train the model using the RidgeCV regression model with and without a quantile transformer. The code is as follows:

Python

from sklearn.model_selection import train_test_split from sklearn.linear_model import RidgeCV from sklearn.compose import TransformedTargetRegressor # Split training and test set X_train, X_test, y_train, y_test = train_test_split( X, y, random_state=1) # Ridge CV Prediction without Data Transformation ridge_cv = RidgeCV() ridge_cv.fit(X_train, y_train) y_ridge = ridge_cv.predict(X_test) # Ridge CV Prediction with Data Transformation ridge_cv_trans = TransformedTargetRegressor( regressor=RidgeCV(), transformer=QuantileTransformer(n_quantiles=900, output_distribution="normal"), ) ridge_cv_trans.fit(X_train, y_train) y_ridge_trans = ridge_cv_trans.predict(X_test)

In the above code, we split the training and test sets using the train_test_split() method. Then we trained the model using the RidgeCV() regression model, with and without a data transformer. 

Here, we used the TransformedTargetRegressor() class from Sklearn to regress on a transformed target. It applies the mentioned QuantileTransformer() class to the target in RidgeCV regression. Let’s plot the actual and predicted values for the ridge regressor without and with data transformation.

Python

from sklearn.metrics import PredictionErrorDisplay f, (ax0, ax1) = plt.subplots(1, 2, figsize=(10, 3)) ax0.set_title("Without Transformer") ax1.set_title("With Transformer") # plot the actual vs predicted values PredictionErrorDisplay.from_predictions( y_test, y_ridge, kind="actual_vs_predicted", ax=ax0, scatter_kwargs={"alpha": 0.8}, ) PredictionErrorDisplay.from_predictions( y_test, y_ridge_trans, kind="actual_vs_predicted", ax=ax1, scatter_kwargs={"alpha": 0.8}, ) plt.tight_layout()

Output

With and Without Data Transformer

Using the PredictionErrorDisplay() class, we plotted the actual vs predicted values for transformed and non-transformed data. In the regression model with target transformation, the shape is more linear, indicating better model fit, whereas the model without target transformation takes a curved shape.

Let’s compute the R2 score and Median Absolute Error for both the regression model (with and without data transformation).

Python

from sklearn.metrics import median_absolute_error, r2_score print("Ridge Regression without data transformation") print("R2 Score:", r2_score(y_test, y_ridge)) print("Median Absolute Error::", median_absolute_error(y_test, y_ridge)) print("\n Ridge Regression with data transformation") print("R2 Score:", r2_score(y_test, y_ridge_trans)) print("Median Absolute Error::", median_absolute_error(y_test, y_ridge_trans))

Output

Ridge Regression without data transformation R2 Score: 0.820797586022058 Median Absolute Error:: 16.144323447490024 Ridge Regression with data transformation R2 Score: 0.8993450921887343 Median Absolute Error:: 10.903408258301198

The transformation results in an increase in R2 and a large decrease in the median absolute error.

Quantile Transformer for Outlier Detection

Data transformation is a mathematical function that changes the data into a scaled value, which makes it possible to compare different columns, e.g., salary in INR with weight in kilograms. Transforming the data will satisfy certain mathematical assumptions such as normalization, standardization, homogeneity, linearity, etc. Quantile Transformer is one of the data transformer techniques for standardizing data.

In this article, we will dig deep into the Quantile Transformer and will understand and implement the significance of quantile transformer for detecting outlier.

Table of Content

  • Understanding Quantile Transformer
  • Quantile Transformer for Detecting Outliers
  • Quantile Transformation Approaches for Outlier Identification
    • 1. Uniform Distribution
    • 2. Normal Distribution (Gaussian)
  • How Quantile Transformer Works for Outlier Detection?
  • Utilizing Quantile Transformer for Outlier Detection in Scikit-learn
  • Advantages and Disadvantages of Quantile Transformer for Outlier Detection

Similar Reads

Understanding Quantile Transformer

The QuantileTransformer in Scikit-Learn is a powerful tool for transforming features in a dataset to follow a specific distribution, such as a Gaussian or Uniform distribution. This transformation is particularly useful in machine learning when the assumption of normality is required for certain models or when the data is highly skewed....

Quantile Transformer for Detecting Outliers

In the context of outlier detection, the QuantileTransformer can be used to transform the data in a way that makes outliers more visible. By transforming the data to a Uniform distribution, outliers will be mapped to the extremes of the distribution, making them more distinguishable from inliers. It can efficiently reduce the impact of outliers, and therefore it is a robust preprocessing scheme....

Quantile Transformation Approaches for Outlier Identification

The quantile transformer transforms features using quantile information. It is applied to each feature independently. The steps are as follows:...

How Quantile Transformer Works for Outlier Detection?

The quantile transformer uses the quantile function to rank the relationship between each observation. Here, the quantile function may follow a normal or uniform distribution. The function is applied to each feature where the transformer spreads the most frequent values, thereby reducing the impact of outliers. Here, it doesn’t remove the outlier but shrinks them to a defined range, thereby making them indistinguishable from inliers....

Utilizing Quantile Transformer for Outlier Detection in Scikit-learn

Scikit-Learn provides a handy class to take care of data transformation using quantile functions. The details are as follows:...

Advantages and Disadvantages of Quantile Transformer for Outlier Detection

Let’s look at the advantages and limitations of using a quantile transformer for outliers....

Conclusion

Most of the machine learning algorithms perform well with a uniform or normal data distribution. A quantile transformer is a useful tool that automatically transforms a dataset into a uniform or normal data distribution. Here, the entire data (including the outliers) is mapped to a uniform distribution, which makes the outliers indistinguishable from the inliers....