Text Classification using Bag of Words

We will be using the CSV file of Poems from poetryfoundation.org from kaggle.com.

Step 1: Install the libraries

install.packages("data.table")
install.packages("stringr")
install.packages("tm")
install.packages("caret")

Step 2: Import the data

R

library(data.table) 
library(stringr) 
library(tm) 
library(slam) 
  
data = read.csv("/kaggle/input/modern-renaissance-poetry/all.csv")$content 

Preprocessing of Data

Before moving ahead, a text needs to be preprocessed before moving ahead. Here are two texts, for eg:

Hello ,,, how are <b>you</b>
I am fine, what about you?

First sentence contains lot of unnecessary characters which can make the model inaccurate. However the second sentence is quite perfect, still the comma, question mark is not required. Punctuation generally don’t add much information, similarly the case.

Second is stopwords. These words such as “and”, “in”, “on”, “the” don’t add much information and can skew the model. Hence we need to remove such words.

Step 3: Preprocess the data. It includes:

Convert to lower case
Remove punctuation
remove numbers
remove stopwords
strip whitespace
Finally create a matrix of words to document

We will be using tm and slam used for text mining and matrix usage.

R

corpus = Corpus(VectorSource(data)) 
  
corpus = tm_map(corpus, content_transformer(tolower)) 
corpus = tm_map(corpus, removePunctuation) 
corpus = tm_map(corpus, removeNumbers) 
  
# removing stop words such as the, a, etc. 
corpus = tm_map(corpus, removeWords, stopwords("SMART")) 
# removing white space 
corpus = tm_map(corpus, stripWhitespace)

Document-Term Matrix

It is the frequency table with each document on one axis, and dictionary on the other. We will create a matrix using the DocumentTermMatrix method by passing the data corpus and then convert the object into matrix.

R

matrix = as.matrix(DocumentTermMatrix(corpus)) 
print(matrix)

Output:

Step 4: First sum the columns of each word and then check the top ten words

R

word_frequencies = colSums(matrix) 
  
N = 10 
top_words = names(sort(word_frequencies, decreasing = TRUE)[1:N]) 
print(word_frequencies[top_words])

Output:

Step 5: We can also plot the word frequencies using barplot

R

top_frequencies = word_frequencies[top_words] 
barplot(top_frequencies, main = "Top 10 Word Frequencies", xlab = "Words", ylab = "Frequency", col = "darkgreen")

Output:

Bag-Of-Words Model In R

Effectively representing textual data is crucial for training models in Machine Learning. The Bag-of-Words (BOW) model serves this purpose by transforming text into numerical form. This article comprehensively explores the Bag-of-Words model, elucidating its fundamental concepts and utility in text representation for Machine Learning.

Tags:

#Geeks Premier League 2023 #Geeks Premier League #R Language

What is Bag-of-Words?

Bag-Of-Words Model In R

Text Classification using Bag of Words

R

Preprocessing of Data

R

Document-Term Matrix

R

R

R

Bag-Of-Words Model In R

Similar Reads