Confusion Matrix In R
In machine learning and statistical classification, the confusion matrix serves as a fundamental tool for evaluating the performance of a predictive model. It provides a concise summary of the classification results produced by a model, revealing the number of true positives, true negatives, false positives, and false negatives. In R Programming Language creating and interpreting a confusion matrix is straightforward, thanks to the availability of various libraries and functions designed for this purpose.
What is a Confusion Matrix?
A confusion matrix is a tabular representation of the performance of a classification model. It compares the predicted labels of a model with the actual labels from the dataset. The matrix is organized into rows and columns, where each row represents the actual class and each column represents the predicted class. The four essential components of a confusion matrix are:
- True Positive (TP): Correctly predicted values.
- True Negative (TN): Correctly predicted as negative.
- False Positive (FP): Instances that are incorrectly predicted as positive.
- False Negative (FN): Instances that are incorrectly predicted as negative.
Creating a Confusion Matrix in R
R offers several packages for working with confusion matrices, including caret, MLmetrics, and yardstick. Let’s explore how to create and interpret a confusion matrix using the caret package:
Binary Classification
In this example, we’ll use a simple binary classification scenario to create and interpret a confusion matrix.
# Load required libraries
library(caret)
# Generate example data
actual <- factor(c(1, 0, 1, 1, 0, 0, 1, 0, 1, 0))
predicted <- factor(c(1, 0, 1, 1, 0, 0, 1, 0, 1, 0))
# Create confusion matrix
conf_matrix <- confusionMatrix(actual, predicted)
# Print confusion matrix
print(conf_matrix)
Output:
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 5 0
1 0 5
Accuracy : 1
95% CI : (0.6915, 1)
No Information Rate : 0.5
P-Value [Acc > NIR] : 0.0009766
Kappa : 1
Mcnemar's Test P-Value : NA
Sensitivity : 1.0
Specificity : 1.0
Pos Pred Value : 1.0
Neg Pred Value : 1.0
Prevalence : 0.5
Detection Rate : 0.5
Detection Prevalence : 0.5
Balanced Accuracy : 1.0
'Positive' Class : 0
The output will display the confusion matrix along with various performance metrics such as accuracy, sensitivity (recall), specificity, and precision.
Multi-class Classification
In this example, we’ll work with a multi-class classification scenario using the famous Iris dataset.
# Load required libraries
library(caret)
# Load Iris dataset
data(iris)
# Split data into training and testing sets
set.seed(123)
train_index <- createDataPartition(iris$Species, p = 0.8, list = FALSE)
train_data <- iris[train_index, ]
test_data <- iris[-train_index, ]
# Train a model (e.g., using a decision tree)
model <- train(Species ~ ., data = train_data, method = "rpart")
# Make predictions on test data
predicted <- predict(model, test_data)
# Create confusion matrix
conf_matrix <- confusionMatrix(test_data$Species, predicted)
# Print confusion matrix
print(conf_matrix)
Output:
Confusion Matrix and Statistics
Reference
Prediction setosa versicolor virginica
setosa 10 0 0
versicolor 0 10 0
virginica 0 2 8
Overall Statistics
Accuracy : 0.9333
95% CI : (0.7793, 0.9918)
No Information Rate : 0.4
P-Value [Acc > NIR] : 1.181e-09
Kappa : 0.9
Mcnemar's Test P-Value : NA
Statistics by Class:
Class: setosa Class: versicolor Class: virginica
Sensitivity 1.0000 0.8333 1.0000
Specificity 1.0000 1.0000 0.9091
Pos Pred Value 1.0000 1.0000 0.8000
Neg Pred Value 1.0000 0.9000 1.0000
Prevalence 0.3333 0.4000 0.2667
Detection Rate 0.3333 0.3333 0.2667
Detection Prevalence 0.3333 0.3333 0.3333
Balanced Accuracy 1.0000 0.9167 0.9545
Once you have created the confusion matrix, interpreting it is crucial for understanding the performance of your model. Here’s how you can interpret the various components:
- Additionally, you can derive various performance metrics from the confusion matrix, such as accuracy, precision, recall (sensitivity), specificity, F1-score, and the area under the ROC curve (AUC).
Conclusion
The confusion matrix is a powerful tool for evaluating the performance of classification models in R. By providing a detailed breakdown of prediction outcomes, it enables data scientists and machine learning practitioners to assess the strengths and weaknesses of their models effectively. With the help of R packages like caret, creating and interpreting confusion matrices becomes an integral part of the model evaluation process, contributing to more informed decision-making and model refinement.