Generate bigrams with NLTK

Bigrams, or pairs of consecutive words, are an essential concept in natural language processing (NLP) and computational linguistics. Their utility spans various applications, from enhancing machine learning models to improving language understanding in AI systems. In this article, we are going to learn how bigrams are generated using NLTK library.

Table of Content

  • What are Bigrams?
  • How Bigrams are generated?
  • Generating Bigrams using NLTK
  • Applications of Bigrams
  • FAQs on Bigrams in NLP

What are Bigrams?

In a sequence of text, bigrams are pairs of consecutive words or tokens. Bigrams allow us to see which words commonly co-occur within a given dataset, which can be particularly useful for:

  • Predictive text and autocomplete features, where the next word is predicted based on the previous word.
  • Speech recognition systems to improve accuracy by considering two words at a time.
  • Information retrieval systems to enhance search accuracy.

How Bigrams are generated?

Let’s take an example sentence “You are learning from Beginner for Beginner”, generating bigrams involves taking two adjacent words at a time to form pairs. Let’s break down the process and the purpose of each bigram:

Step 1: Tokenization

The first step is to split the sentence into individual words (tokens). For the sentence “You are learning from Beginner for Beginner”, the tokens would be:

 ['You', 'are', 'learning', 'from', 'Beginner', 'for', 'Beginner']

Step 2: Creating Bigrams

After tokenization, bigrams are formed by pairing each word with the next word in the sequence. Here’s how each bigram is constructed from the tokens:

  1. (‘You’, ‘are’): This bigram pairs the first word “You” with the second word “are”. It helps in understanding the use of the pronoun “You” in a command or statement form, indicating the subject of the sentence.
  2. (‘are’, ‘learning’): This bigram links “are” with “learning”, forming a verb phrase that indicates an ongoing action. It’s crucial for capturing the progressive tense in the sentence.
  3. (‘learning’, ‘from’): Connecting “learning” with “from” helps in identifying the prepositional phrase that specifies the source or method of learning.
  4. (‘from’, ‘Beginner’): This bigram pairs “from” with “Beginner”, which indicates the starting point or the source of learning, in this case, the entity “Beginner”.
  5. (‘Beginner’, ‘for’): By pairing “Beginner” with “for”, it sets up another phrase, hinting at a purpose or reason which is about to be explained further.
  6. (‘for’, ‘Beginner’): This final bigram closes the loop by linking “for” back to “Beginner”, suggesting a repetitive or cyclical learning process from the same source, or could imply that the learning is intended for “Beginner”, depending on additional context that might be present in a longer text.

Each of these bigrams captures a small piece of the syntactic and semantic structure of the sentence. Analyzing these pairs helps in understanding how words combine to form meaningful phrases that contribute to the overall meaning of the sentence.

Generating Bigrams using NLTK

Generating bigrams using the Natural Language Toolkit (NLTK) in Python is a straightforward process. The steps to generated bigrams from text data using NLTK are discussed below:

  1. Import NLTK and Download Tokenizer: The code first imports the nltk library and downloads the punkt tokenizer, which is part of NLTK’s data used for tokenization.
  2. Tokenization: The word_tokenize() function from nltk.tokenize is used to tokenize the input text into a list of words (tokens). Tokenization is the process of splitting a text into individual words or tokens.
  3. Generating Bigrams: The bigrams function from nltk.util is then used to generate a list of bigrams from the tokenized words. Each bigram is a tuple containing two consecutive words from the text.
  4. Printing Bigrams: Finally, the code iterates over the list of bigrams (bigram_list) and prints each bigram.
Python3
import nltk
nltk.download('punkt') # Download the 'punkt' tokenizer
from nltk.tokenize import word_tokenize
from nltk.util import bigrams

# Sample text
text = "You are learning from Beginner for Beginner"

# Tokenize the text
tokens = word_tokenize(text)

# Generate bigrams
bigram_list = list(bigrams(tokens))

# Print the bigrams
for bigram in bigram_list:
    print(bigram)

Output:

('You', 'are')
('are', 'learning')
('learning', 'from')
('from', 'Beginner')
('Beginner', 'for')
('for', 'Beginner')

Applications of Bigrams

Bigram applications in natural language processing (NLP) and text analysis are :

  • Language modeling: Language models are statistical models that calculate the likelihood of a sequence of words appearing together in a particular language. Bigrams estimates the probability of one word following another. These models are applied in machine translation, speech recognition among other activities.
  • Text prediction: Text completion relies on bigrams to guess what comes next given what came before it. You can predict reasonably well the following word by studying the frequency of various bigrams occurring within your data set only. This is used by autocomplete features on search engines or messaging apps.
  • Information retrieval: Bigrams indexation during information retrieval systems design allows for faster searching through documents because they represent more closely related pairs than single terms do thus increasing precision as well as recall rates when retrieving searched items from large collections like those found online.
  • Text classification: Sentiment analysis, spam detection, and topic categorization can all benefit from treating bi-grams as classification attributes. In sentiment analysis for example considering two-word combinations gives more context about the expressed opinion thus helping classifiers make accurate decisions whether something is positive or negative.
  • Named Entity Recognition (NER): In NER systems, bigrams are used for the identification of named entities like person names, locations and organizations in text. NER models performance can be enhanced by this through capturing patterns of words that frequently appear together within such entities.
  • Spelling Correction: Bigrams may also be employed in spelling correction systems to propose corrections for misspelled words. A system could suggest likely alternatives by comparing a misspelt word’s bigram with those from correctly spelled words found on dictionary.

FAQs on Bigrams in NLP

Why are bigrams important?

Bigrams help us to understand some of the sequential structures of languages and can reveal relationships between words within a given context. They are used in sentiment analysis, part-of-speech tagging (POS), named entity recognition (NER), information retrieval (IR) etc.

How can I generate bigrams using NLTK?

To generate bigrams using NLTK library, you need to follow two steps :Tokenize your text into words (or sentences) using word_tokenize() function.Then call bigrams() method on created tokens.