Bag-Of-Words Model In R

In the following example, we use spam email dataset for the classification using bag of words. We use SVM classifier for classification of spam and ham(original).

Step 1: Load all required libraries

R




library(data.table) # Dataframe library
library(stringr) # For string methods
library(caret) # For confusion matrix
library(tm) # For Text Mining library
library(slam) # For preprocessing
library(e1071)  # For SVM classifier


Step 2: Load the dataset and preprocess the dataset similar to that of previous example

R




data = read.csv("/kaggle/input/spam-email/spam.csv")
  
corpus = Corpus(VectorSource(data$Message))
labels = as.numeric(factor(data$Category, levels = c("ham", "spam")))
labels = labels - 1
# Preprocess the text
corpus = tm_map(corpus, content_transformer(tolower))
corpus = tm_map(corpus, removePunctuation)
corpus = tm_map(corpus, removeNumbers)
corpus = tm_map(corpus, removeWords, stopwords("SMART"))
corpus = tm_map(corpus, stripWhitespace)
  
dtm <- DocumentTermMatrix(corpus)
matrix <- as.matrix(dtm)
  
#converting matrix to dataframe
dtm_df <- as.data.table(matrix)
  
# applying the labels that are used for ham and spam using numerical encoding 
dtm_df$label <- labels
head(dtm_df)


Output:

Step 3: Perform the train test split in the ration of 80% 20% for train and test set respectively.

R




# setting randomization
set.seed(42)
#splitting the dataset into train and test in ration 80-20
split_index <- sample(1:nrow(dtm_df), 0.8 * nrow(dtm_df))
train_set <- dtm_df[split_index]
test_set <- dtm_df[-split_index]


Step 4: Train the model and create predictions. Then create confusion matrix

R




# training the model using training dataset
model <- svm(label ~ ., data = train_set, kernel = "linear")
  
# Make predictions on the test set
predictions <- predict(model, newdata = test_set[, -"label"])
# Evaluate the model
threshold <- 0.5
  
# converting svm prediction from probabilities into binary
binary_predictions <- ifelse(predictions > threshold, 1, 0)
confusion_matrix <- table(binary_predictions, test_set$label)
# printing the matrix
print(confusion_matrix)
  
# finding the accuracy of the model
accuracy <- sum(diag(confusion_matrix)) / sum(confusion_matrix)
print(paste("Accuracy:", accuracy))


Output:

binary_predictions   0   1
0 958 23
1 1 133
[1] "Accuracy: 0.97847533632287"

Hence, The model has a high accuracy of approximately 97.85%, indicating that it correctly predicts the class for a large proportion of instances.

Bag-Of-Words Model In R

Effectively representing textual data is crucial for training models in Machine Learning. The Bag-of-Words (BOW) model serves this purpose by transforming text into numerical form. This article comprehensively explores the Bag-of-Words model, elucidating its fundamental concepts and utility in text representation for Machine Learning.

Similar Reads

What is Bag-of-Words?

Bag-of-words is useful for representing textual data in a passage when using text for training and modelling in Machine Learning. We represent the text in the form of numbers generally in Machine Learning. BOW allows to extract features from text using numerous ways to convert text into numbers. It provides two main features:...

Text Classification using Bag of Words

We will be using the CSV file of Poems from poetryfoundation.org from kaggle.com....

Bag-Of-Words Model In R

...

Limitations to Bag-of-Words

...

Conclusion

...