Bag of word and Frequency count in text using sklearn

Text data is ubiquitous in today’s digital world, from emails and social media posts to research articles and customer reviews. To analyze and derive insights from this textual information, it’s essential to convert text into a numerical form that machine learning algorithms can process. One of the fundamental methods for this conversion is the “Bag of Words” (BoW) model, which represents text as a collection of word frequencies. In this article, we will explore the BoW model, its implementation, and how to perform frequency counts using Scikit-learn, a powerful machine-learning library in Python.

What is the Bag of Words Model?

The Bag of Words model is a simple and effective way of representing text data. It treats a text document as an unordered collection of words, disregarding grammar and word order while preserving the word frequency. The primary steps involved in creating a BoW model are:

  • Tokenization: Splitting the text into individual words (tokens).
  • Vocabulary Building: Creating a vocabulary of unique words from the entire corpus.
  • Vectorization: Transforming each document into a numerical vector based on the frequency of each word in the vocabulary.

Example: Consider a small corpus with the following two sentences:

"The cat sat on the mat."
"The dog sat on the log."

The vocabulary would consist of the unique words: [“the”, “cat”, “sat”, “on”, “mat”, “dog”, “log”]. Each sentence is then represented as a vector of word counts:

"The cat sat on the mat.": [2, 1, 1, 1, 1, 0, 0]" 
The dog sat on the log.": [2, 0, 1, 1, 0, 1, 1]

Implementing Bag of Words with Scikit-learn

Scikit-learn provides a straightforward implementation of the BoW model through its CountVectorizer class.

Here’s a step-by-step guide to implementing BoW and performing frequency counts using Scikit-learn.

Python
# Step 1: Preparing the Corpus

from sklearn.feature_extraction.text import CountVectorizer
import numpy as np

# Step 2: Preparing the Corpus
corpus = [
    "The cat sat on the mat.",
    "The dog sat on the log."
]

# Step 3: Initializing and Fitting the CountVectorizer
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)

# Step 4: Displaying the Vocabulary and Frequency Counts
print("Vocabulary:", vectorizer.vocabulary_)
print("Feature Names:", vectorizer.get_feature_names_out())
print("Bag of Words Representation:\n", X.toarray())

# Step 5: Analyzing Word Frequencies
word_counts = np.sum(X.toarray(), axis=0)
word_freq = dict(zip(vectorizer.get_feature_names_out(), word_counts))
print("Word Frequencies:", word_freq)

Output:

Vocabulary: {'the': 6, 'cat': 0, 'sat': 5, 'on': 4, 'mat': 3, 'dog': 1, 'log': 2}
Feature Names: ['cat' 'dog' 'log' 'mat' 'on' 'sat' 'the']
Bag of Words Representation:
[[1 0 0 1 1 1 2]
[0 1 1 0 1 1 2]]
Word Frequencies: {'cat': 1, 'dog': 1, 'log': 1, 'mat': 1, 'on': 2, 'sat': 2, 'the': 4}

Conclusion

In this article, we’ve covered the basic steps to create a BoW model and perform frequency count analysis using Scikit-learn. This knowledge serves as a stepping stone to more advanced text processing techniques, such as TF-IDF, word embeddings, and neural network-based models, which build upon the concepts introduced here.