Bag-Of-Words Model In R

In the following example, we use spam email dataset for the classification using bag of words. We use SVM classifier for classification of spam and ham(original).

Step 1: Load all required libraries


library(data.table) # Dataframe library
library(stringr) # For string methods
library(caret) # For confusion matrix
library(tm) # For Text Mining library
library(slam) # For preprocessing
library(e1071)  # For SVM classifier

Step 2: Load the dataset and preprocess the dataset similar to that of previous example


data = read.csv("/kaggle/input/spam-email/spam.csv")
corpus = Corpus(VectorSource(data$Message))
labels = as.numeric(factor(data$Category, levels = c("ham", "spam")))
labels = labels - 1
# Preprocess the text
corpus = tm_map(corpus, content_transformer(tolower))
corpus = tm_map(corpus, removePunctuation)
corpus = tm_map(corpus, removeNumbers)
corpus = tm_map(corpus, removeWords, stopwords("SMART"))
corpus = tm_map(corpus, stripWhitespace)
dtm <- DocumentTermMatrix(corpus)
matrix <- as.matrix(dtm)
#converting matrix to dataframe
dtm_df <-
# applying the labels that are used for ham and spam using numerical encoding 
dtm_df$label <- labels


Step 3: Perform the train test split in the ration of 80% 20% for train and test set respectively.


# setting randomization
#splitting the dataset into train and test in ration 80-20
split_index <- sample(1:nrow(dtm_df), 0.8 * nrow(dtm_df))
train_set <- dtm_df[split_index]
test_set <- dtm_df[-split_index]

Step 4: Train the model and create predictions. Then create confusion matrix


# training the model using training dataset
model <- svm(label ~ ., data = train_set, kernel = "linear")
# Make predictions on the test set
predictions <- predict(model, newdata = test_set[, -"label"])
# Evaluate the model
threshold <- 0.5
# converting svm prediction from probabilities into binary
binary_predictions <- ifelse(predictions > threshold, 1, 0)
confusion_matrix <- table(binary_predictions, test_set$label)
# printing the matrix
# finding the accuracy of the model
accuracy <- sum(diag(confusion_matrix)) / sum(confusion_matrix)
print(paste("Accuracy:", accuracy))


binary_predictions   0   1
0 958 23
1 1 133
[1] "Accuracy: 0.97847533632287"

Hence, The model has a high accuracy of approximately 97.85%, indicating that it correctly predicts the class for a large proportion of instances.

Effectively representing textual data is crucial for training models in Machine Learning. The Bag-of-Words (BOW) model serves this purpose by transforming text into numerical form. This article comprehensively explores the Bag-of-Words model, elucidating its fundamental concepts and utility in text representation for Machine Learning.

