Bag-Of-Words Model In R

In the following example, we use spam email dataset for the classification using bag of words. We use SVM classifier for classification of spam and ham(original).

Step 1: Load all required libraries

R

library(data.table) # Dataframe library 
library(stringr) # For string methods 
library(caret) # For confusion matrix 
library(tm) # For Text Mining library 
library(slam) # For preprocessing 
library(e1071)  # For SVM classifier

Step 2: Load the dataset and preprocess the dataset similar to that of previous example

R

data = read.csv("/kaggle/input/spam-email/spam.csv") 
  
corpus = Corpus(VectorSource(data$Message)) 
labels = as.numeric(factor(data$Category, levels = c("ham", "spam"))) 
labels = labels - 1 
# Preprocess the text 
corpus = tm_map(corpus, content_transformer(tolower)) 
corpus = tm_map(corpus, removePunctuation) 
corpus = tm_map(corpus, removeNumbers) 
corpus = tm_map(corpus, removeWords, stopwords("SMART")) 
corpus = tm_map(corpus, stripWhitespace) 
  
dtm <- DocumentTermMatrix(corpus) 
matrix <- as.matrix(dtm) 
  
#converting matrix to dataframe 
dtm_df <- as.data.table(matrix) 
  
# applying the labels that are used for ham and spam using numerical encoding  
dtm_df$label <- labels 
head(dtm_df)

Output:

Step 3: Perform the train test split in the ration of 80% 20% for train and test set respectively.

R

# setting randomization 
set.seed(42) 
#splitting the dataset into train and test in ration 80-20 
split_index <- sample(1:nrow(dtm_df), 0.8 * nrow(dtm_df)) 
train_set <- dtm_df[split_index] 
test_set <- dtm_df[-split_index]

Step 4: Train the model and create predictions. Then create confusion matrix

R

# training the model using training dataset 
model <- svm(label ~ ., data = train_set, kernel = "linear") 
  
# Make predictions on the test set 
predictions <- predict(model, newdata = test_set[, -"label"]) 
# Evaluate the model 
threshold <- 0.5 
  
# converting svm prediction from probabilities into binary 
binary_predictions <- ifelse(predictions > threshold, 1, 0) 
confusion_matrix <- table(binary_predictions, test_set$label) 
# printing the matrix 
print(confusion_matrix) 
  
# finding the accuracy of the model 
accuracy <- sum(diag(confusion_matrix)) / sum(confusion_matrix) 
print(paste("Accuracy:", accuracy))

Output:

binary_predictions   0   1
                 0 958  23
                 1   1 133
[1] "Accuracy: 0.97847533632287"

Hence, The model has a high accuracy of approximately 97.85%, indicating that it correctly predicts the class for a large proportion of instances.