BERT Embeddings

Another important pre trained transformer based model is by Google known as BERT or Bidirectional Encoder Representations from Transformers. It can be used to extract high quality language features from raw text or can be fine-tuned on own data to perform specific tasks.

BERT’s architecture consists of only encoders and input received is a sequence of tokens i.e. Token embeddings, Segment embeddings and Positional embeddings. The main idea is to mask a few words in a sentence and task the model to predict the masked words.

BERT

Firstly, install the transformers library as we’ll be using pytorch and transformers for implementing this.

!pip install transformers

Python

import torch
from transformers import BertTokenizer
 
# Load pre-trained model tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
text = "This blog post explains pre trained word embeddings"
marked_text = "[CLS] " + text + " [SEP]"
 
# Tokenize the sentence with the BERT tokenizer.
tokenized_text = tokenizer.tokenize(marked_text)
 
#mapping the tokens with their indexes in vocabulary
indexed_tokens = tokenizer.convert_tokens_to_ids(tokenized_text)
 
# Print out the tokens.
print (tokenized_text)

                    

Output:

['[CLS]', 'this', 'blog', 'post', 'explains', 'pre', 'trained', 'word', 'em', '##bed', '##ding', '##s', '[SEP]']

Python

#Marking all the tokens to a single sentence
segments_ids = [1] * len(tokenized_text)
 
# Convert inputs to PyTorch tensors
tokens_tensor = torch.tensor([indexed_tokens])
segments_tensors = torch.tensor([segments_ids])
 
#initialising the model
model = BertModel.from_pretrained('bert-base-uncased',
                                  output_hidden_states = True,
                                 )
 
model.eval()
with torch.no_grad():
 
    outputs = model(tokens_tensor, segments_tensors)
    hidden_states = outputs[2]
 
#concatenating the last four layers of the output to get the embeddings
word_embedding = torch.cat([hidden_states[i] for i in [-1,-2,-3,-4]], dim=-1)
print(word_embedding)

                    

Output:

tensor([[[ 0.1508, -0.0126, -0.0503,  ...,  0.0346,  0.4191,  0.2692],
[-0.2833, -0.4473, -0.1290, ..., 0.1606, 0.5159, 0.2478],
[ 0.6687, -0.4654, 0.3076, ..., 0.2321, -0.0784, -0.7501],
...,
[-0.1049, 0.6510, -0.3414, ..., 0.0136, 0.3559, 0.0941],
[-0.2663, -0.0465, -0.2842, ..., -0.4947, 0.0606, 0.1420],
[ 0.8174, 0.2086, -0.4486, ..., -0.0698, -0.0547, -0.0229]]])

Conclusion

Generating word embedding is a crucial technique to solve natural language problems and pre trained embeddings offer a powerful solution to the complexities associated with generating word embeddings from scratch.



Pre-Trained Word Embedding in NLP

Word Embedding is an important term in Natural Language Processing and a significant breakthrough in deep learning that solved many problems. In this article, we’ll be looking into what pre-trained word embeddings in NLP are.

Table of Content

  • Word Embeddings
  • Challenges in building word embedding from scratch
  • Pre Trained Word Embeddings
  • Word2Vec
  • GloVe
  • BERT Embeddings

Similar Reads

Word Embeddings

...

Challenges in building word embedding from scratch

Word embedding is an approach in Natural language Processing where raw text gets converted to numbers/vectors. As deep learning models only take numerical input this technique becomes important to process the raw data. It helps in capturing the semantic meaning as well as the context of the words. A real-valued vector with various dimensions represents each word....

Pre Trained Word Embeddings

Training word embeddings from scratch is possible but it is quite challenging due to large trainable parameters and sparsity of training data. These models need to be trained on a large number of datasets with rich vocabulary and as there are large number of parameters, it makes the training slower. So, it’s quite challenging to train a word embedding model on an individual level....

Word2Vec

There’s a solution to the above problem, i.e., using pre-trained word embeddings. Pre-trained word embeddings are trained on large datasets and capture the syntactic as well as semantic meaning of the words. This technique is known as transfer learning in which you take a model which is trained on large datasets and use that model on your own similar tasks....

GloVe

Word2Vec is one of the most popular pre trained word embeddings developed by Google. It is trained on Good news dataset which is an extensive dataset. As the name suggests, it represents each word with a collection of integers known as a vector. The vectors are calculated such that they show the semantic relation between words....

BERT Embeddings

...