Linear and Quadratic Discriminant Analysis using Sklearn
Linear Discriminant Analysis (LDA) and Quadratic Discriminant Analysis (QDA) are two well-known classification methods that are used in machine learning to find patterns and put things into groups. They are especially helpful when you have labeled data and want to classify new observations notes into pre-defined categories.
In this we will implement both these techniques, Linear and Quadratic Discriminant Analysis using Sklearn.
Table of Content
- Understanding Linear and Quadratic Discriminant Analysis
- Implementing Linear and Quadratic Discriminant Analysis with Scikit-Learn
- Applying Linear Discriminant Analysis (LDA)
- Applying Quadratic Discriminant Analysis (QDA)
- Visualizing Linear and Quadratic Discriminant Analysis
Understanding Linear and Quadratic Discriminant Analysis
Linear Discriminant Analysis (LDA)
Linear Discriminant Analysis assumes that the data in each class is normally distributed and has the same correlation matrix. It finds a linear combination of features that best separates the classes apart, sometimes referred to as Fisher’s linear discriminant. The idea is to maximize the distance between classes while projecting the data into a lower-dimensional space.
Under the presumptions, LDA determines the best linear decision boundary by minimizing the ratio of variation within a class to variance across classes.
The steps to compute LDA using sklearn are:
- Compute the mean vectors for each class.
- Compute the within-class and between-class scatter matrices.
- Compute the eigenvalues and eigenvectors for the scatter matrices.
- Select the top k eigenvectors that match to the k biggest eigenvalues to make a new feature space.
- Project the data onto the new feature space.
Quadratic Discriminant Analysis (QDA)
QDA is similar to LDA but does not assume that the correlation matrices of each class are equal. This helps QDA to build more flexible decision limits by describing each class with its own correlation matrix.
The steps to compute QDA using sklearn are:
- Compute the mean vector and correlation matrix for each class.
- Use the quadratic form of the discriminant function to describe new data.
Implementing Linear and Quadratic Discriminant Analysis with Scikit-Learn
Scikit-Learn is a well-known Python machine learning package that offers effective implementations of Linear Discriminant Analysis (LDA) and Quadratic Discriminant Analysis (QDA) via their respective classes. To use LDA or QDA in Scikit-Learn, Let’s go through with below steps
1. Import the Necessary Modules
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis, QuadraticDiscriminantAnalysis
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
2. Generate Data
# Generate synthetic data
X, y = make_classification(n_samples=1000, n_features=2, n_informative=2, n_redundant=0,
n_clusters_per_class=1, n_classes=3, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
Applying Linear Discriminant Analysis (LDA)
# Initialize and train the LDA model
lda = LinearDiscriminantAnalysis()
lda.fit(X_train, y_train)
y_pred_lda = lda.predict(X_test)
print("LDA Accuracy:", accuracy_score(y_test, y_pred_lda))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred_lda))
print("Classification Report:\n", classification_report(y_test, y_pred_lda))
Output:
LDA Accuracy: 0.8266666666666667
Confusion Matrix (LDA):
[[ 75 4 22]
[ 16 71 0]
[ 0 10 102]]
Classification Report (LDA):
precision recall f1-score support
0 0.82 0.74 0.78 101
1 0.84 0.82 0.83 87
2 0.82 0.91 0.86 112
accuracy 0.83 300
macro avg 0.83 0.82 0.82 300
weighted avg 0.83 0.83 0.83 300
Applying Quadratic Discriminant Analysis (QDA)
# Initialize and train the QDA model
qda = QuadraticDiscriminantAnalysis()
qda.fit(X_train, y_train)
# Make predictions
y_pred_qda = qda.predict(X_test)
# Evaluate the model
print("QDA Accuracy:", accuracy_score(y_test, y_pred_qda))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred_qda))
print("Classification Report:\n", classification_report(y_test, y_pred_qda))
Output:
QDA Accuracy: 0.93
Confusion Matrix (QDA):
[[ 96 2 3]
[ 10 77 0]
[ 4 2 106]]
Classification Report (QDA):
precision recall f1-score support
0 0.87 0.95 0.91 101
1 0.95 0.89 0.92 87
2 0.97 0.95 0.96 112
accuracy 0.93 300
macro avg 0.93 0.93 0.93 300
weighted avg 0.93 0.93 0.93 300
Visualizing Linear and Quadratic Discriminant Analysis
For visualization let’s plot decision boundaries , the decision border is a line that divides the two classes of data points. The goal of a classifier is to predict the class of a new data point, based on its features. The decision border shows the classifier’s rule for splitting the classes.
def plot_decision_boundaries(X, y, model, title, subplot_index):
plt.subplot(subplot_index)
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.01),
np.arange(y_min, y_max, 0.01))
Z = model.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
plt.contourf(xx, yy, Z, alpha=0.8)
plt.scatter(X[:, 0], X[:, 1], c=y, edgecolors='k', marker='o')
plt.title(title)
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.figure(figsize=(10, 4))
# Plot decision boundaries for LDA
plot_decision_boundaries(X_test, y_test, lda, "LDA Decision Boundary", 121)
# Plot decision boundaries for QDA
plot_decision_boundaries(X_test, y_test, qda, "QDA Decision Boundary", 122)
plt.tight_layout()
plt.show()
Output:
The number of dots in the picture does not appear to be linked with the leftovers. Residue, in this case, refers to the difference between the expected value of a data point and its real value.
LDA projects data from a higher-dimensional space onto a lower-dimensional space in a way that maximizes the separation between different classes. In this case, the decision boundary likely separates the data points into two or more classes while QDA allows for a more complex connection. The QDA decision boundary looks to be more flexible than the LDA decision boundary, which may help it to better fit the data in some cases.
Conclusion
Finally, for supervised classification problems, Linear Discriminant Analysis (LDA) and Quadratic Discriminant Analysis (QDA) are effective methods. QDA allows each class to have its own covariance matrix, while LDA relaxes this condition by assuming that the classes have equal covariance matrices. Both approaches are practical and have their merits; Scikit-Learn offers handy implementations that make integrating them into machine learning pipelines simple.
Linear and Quadratic Discriminant Analysis using Sklearn- FAQs
When is it better to employ LDA than QDA?
If you want a simpler model and the classes have comparable covariance matrices, use LDA. When the decision boundary is non-linear or the classes have distinct covariance matrices, use QDA.
Can high-dimensional data be handled by LDA and QDA?
Yes, both QDA and LDA can handle high-dimensional data; however, if there are significantly more features than samples, overfitting may occur.