Text Classification using Bag of Words

We will be using the CSV file of Poems from poetryfoundation.org from kaggle.com.

Step 1: Install the libraries

install.packages("data.table")
install.packages("stringr")
install.packages("tm")
install.packages("caret")

Step 2: Import the data

R




library(data.table)
library(stringr)
library(tm)
library(slam)
  
data = read.csv("/kaggle/input/modern-renaissance-poetry/all.csv")$content


Preprocessing of Data

Before moving ahead, a text needs to be preprocessed before moving ahead. Here are two texts, for eg:

  1. Hello ,,, how are <b>you</b>
  2. I am fine, what about you?

First sentence contains lot of unnecessary characters which can make the model inaccurate. However the second sentence is quite perfect, still the comma, question mark is not required. Punctuation generally don’t add much information, similarly the case.

Second is stopwords. These words such as “and”, “in”, “on”, “the” don’t add much information and can skew the model. Hence we need to remove such words.

Step 3: Preprocess the data. It includes:

  • Convert to lower case
  • Remove punctuation
  • remove numbers
  • remove stopwords
  • strip whitespace
  • Finally create a matrix of words to document

We will be using tm and slam used for text mining and matrix usage.

R




corpus = Corpus(VectorSource(data))
  
corpus = tm_map(corpus, content_transformer(tolower))
corpus = tm_map(corpus, removePunctuation)
corpus = tm_map(corpus, removeNumbers)
  
# removing stop words such as the, a, etc.
corpus = tm_map(corpus, removeWords, stopwords("SMART"))
# removing white space
corpus = tm_map(corpus, stripWhitespace)


Document-Term Matrix

It is the frequency table with each document on one axis, and dictionary on the other. We will create a matrix using the DocumentTermMatrix method by passing the data corpus and then convert the object into matrix.

R




matrix = as.matrix(DocumentTermMatrix(corpus))
print(matrix)


Output:

Step 4: First sum the columns of each word and then check the top ten words

R




word_frequencies = colSums(matrix)
  
N = 10
top_words = names(sort(word_frequencies, decreasing = TRUE)[1:N])
print(word_frequencies[top_words])


Output:

Step 5: We can also plot the word frequencies using barplot

R




top_frequencies = word_frequencies[top_words]
barplot(top_frequencies, main = "Top 10 Word Frequencies", xlab = "Words", ylab = "Frequency", col = "darkgreen")


Output:

Bag-Of-Words Model In R

Effectively representing textual data is crucial for training models in Machine Learning. The Bag-of-Words (BOW) model serves this purpose by transforming text into numerical form. This article comprehensively explores the Bag-of-Words model, elucidating its fundamental concepts and utility in text representation for Machine Learning.

Similar Reads

What is Bag-of-Words?

Bag-of-words is useful for representing textual data in a passage when using text for training and modelling in Machine Learning. We represent the text in the form of numbers generally in Machine Learning. BOW allows to extract features from text using numerous ways to convert text into numbers. It provides two main features:...

Text Classification using Bag of Words

We will be using the CSV file of Poems from poetryfoundation.org from kaggle.com....

Bag-Of-Words Model In R

...

Limitations to Bag-of-Words

...

Conclusion

...