Bag-Of-Words Model In R
In the following example, we use spam email dataset for the classification using bag of words. We use SVM classifier for classification of spam and ham(original).
Step 1: Load all required libraries
R
library (data.table) # Dataframe library library (stringr) # For string methods library (caret) # For confusion matrix library (tm) # For Text Mining library library (slam) # For preprocessing library (e1071) # For SVM classifier |
Step 2: Load the dataset and preprocess the dataset similar to that of previous example
R
data = read.csv ( "/kaggle/input/spam-email/spam.csv" ) corpus = Corpus ( VectorSource (data$Message)) labels = as.numeric ( factor (data$Category, levels = c ( "ham" , "spam" ))) labels = labels - 1 # Preprocess the text corpus = tm_map (corpus, content_transformer (tolower)) corpus = tm_map (corpus, removePunctuation) corpus = tm_map (corpus, removeNumbers) corpus = tm_map (corpus, removeWords, stopwords ( "SMART" )) corpus = tm_map (corpus, stripWhitespace) dtm <- DocumentTermMatrix (corpus) matrix <- as.matrix (dtm) #converting matrix to dataframe dtm_df <- as.data.table (matrix) # applying the labels that are used for ham and spam using numerical encoding dtm_df$label <- labels head (dtm_df) |
Output:
Step 3: Perform the train test split in the ration of 80% 20% for train and test set respectively.
R
# setting randomization set.seed (42) #splitting the dataset into train and test in ration 80-20 split_index <- sample (1: nrow (dtm_df), 0.8 * nrow (dtm_df)) train_set <- dtm_df[split_index] test_set <- dtm_df[-split_index] |
Step 4: Train the model and create predictions. Then create confusion matrix
R
# training the model using training dataset model <- svm (label ~ ., data = train_set, kernel = "linear" ) # Make predictions on the test set predictions <- predict (model, newdata = test_set[, - "label" ]) # Evaluate the model threshold <- 0.5 # converting svm prediction from probabilities into binary binary_predictions <- ifelse (predictions > threshold, 1, 0) confusion_matrix <- table (binary_predictions, test_set$label) # printing the matrix print (confusion_matrix) # finding the accuracy of the model accuracy <- sum ( diag (confusion_matrix)) / sum (confusion_matrix) print ( paste ( "Accuracy:" , accuracy)) |
Output:
binary_predictions 0 1
0 958 23
1 1 133
[1] "Accuracy: 0.97847533632287"
Hence, The model has a high accuracy of approximately 97.85%, indicating that it correctly predicts the class for a large proportion of instances.
Bag-Of-Words Model In R
Effectively representing textual data is crucial for training models in Machine Learning. The Bag-of-Words (BOW) model serves this purpose by transforming text into numerical form. This article comprehensively explores the Bag-of-Words model, elucidating its fundamental concepts and utility in text representation for Machine Learning.