Text Classification using Bag of Words
We will be using the CSV file of Poems from poetryfoundation.org from kaggle.com.
Step 1: Install the libraries
install.packages("data.table")
install.packages("stringr")
install.packages("tm")
install.packages("caret")
Step 2: Import the data
R
library (data.table) library (stringr) library (tm) library (slam) data = read.csv ( "/kaggle/input/modern-renaissance-poetry/all.csv" )$content |
Preprocessing of Data
Before moving ahead, a text needs to be preprocessed before moving ahead. Here are two texts, for eg:
- Hello ,,, how are <b>you</b>
- I am fine, what about you?
First sentence contains lot of unnecessary characters which can make the model inaccurate. However the second sentence is quite perfect, still the comma, question mark is not required. Punctuation generally don’t add much information, similarly the case.
Second is stopwords. These words such as “and”, “in”, “on”, “the” don’t add much information and can skew the model. Hence we need to remove such words.
Step 3: Preprocess the data. It includes:
- Convert to lower case
- Remove punctuation
- remove numbers
- remove stopwords
- strip whitespace
- Finally create a matrix of words to document
We will be using tm and slam used for text mining and matrix usage.
R
corpus = Corpus ( VectorSource (data)) corpus = tm_map (corpus, content_transformer (tolower)) corpus = tm_map (corpus, removePunctuation) corpus = tm_map (corpus, removeNumbers) # removing stop words such as the, a, etc. corpus = tm_map (corpus, removeWords, stopwords ( "SMART" )) # removing white space corpus = tm_map (corpus, stripWhitespace) |
Document-Term Matrix
It is the frequency table with each document on one axis, and dictionary on the other. We will create a matrix using the DocumentTermMatrix method by passing the data corpus and then convert the object into matrix.
R
matrix = as.matrix ( DocumentTermMatrix (corpus)) print (matrix) |
Output:
Step 4: First sum the columns of each word and then check the top ten words
R
word_frequencies = colSums (matrix) N = 10 top_words = names ( sort (word_frequencies, decreasing = TRUE )[1:N]) print (word_frequencies[top_words]) |
Output:
Step 5: We can also plot the word frequencies using barplot
R
top_frequencies = word_frequencies[top_words] barplot (top_frequencies, main = "Top 10 Word Frequencies" , xlab = "Words" , ylab = "Frequency" , col = "darkgreen" ) |
Output:
Bag-Of-Words Model In R
Effectively representing textual data is crucial for training models in Machine Learning. The Bag-of-Words (BOW) model serves this purpose by transforming text into numerical form. This article comprehensively explores the Bag-of-Words model, elucidating its fundamental concepts and utility in text representation for Machine Learning.