Embedding

Embeddings are real-valued dense vectors (multi-dimensional arrays) that carry the meaning of the words. They can capture the context of the word/sentence in a document, semantic similarity, relation with other words/sentences, etc. A popular example of how they extract the contexts from the words is if you remove a man from the king and add a woman, it will output a vector similar to a queen. Also, similar words are close to each other in the embedding space. Many pre-trained models are available such as Word2Vec, GloVe, Bert, etc.

Pytorch Embedding

As defined in the official Pytorch Documentation, an Embedding layer is – “A simple lookup table that stores embeddings of a fixed dictionary and size.” So basically at the low level, the Embedding layer is just a lookup table that maps an index value to a weight matrix of some dimension. This weight matrix is further optimised during training (updated during backpropagation to reduce the loss) to produce more useful vectors.

Working

Now let’s look into the working of embedding in Pytorch. When an embedding layer is created, an embedding matrix is initialised with random vectors having dimensions of (num_embeddings, embedding_dim). This is basically our lookup table where our words are mapped to indexes.

Given an input word or token, represented by its index in the vocabulary, you pass this index to the embedding layer which then looks up the corresponding row in the embedding matrix. The embedding vector is then extracted from the row as output which is of the dimension embedding_dim.

During training, the embedding vectors are updated through backpropagation to minimize the loss. This means the vectors are adjusted to better represent the semantics and relationships between words.

Embedding layer takes minimum of two arguments – num_embeddings and embedding_dim. There are various other optional parameters also such as padding_idx, max_norm, etc. Refer to the official docs for this. Now the first required parameter is num_embeddings which means what is the dictionary size. For example, if you have a vocabulary of 5000 words, then the value that will go into the first parameter will be 5000. The second required parameter is embedding_dim which means the size of each embedding vector(as all the learned vectors will have a fixed size).

CBOW and Skip-gram Techniques

There are two major techniques in embeddings known as Continuous Bag of Words (CBOW) and Skip gram. Let’s learn about them a little below-

Continuous Bag of Words (CBOW)– BOW predicts a target word based on the surrounding context words. This means that, for it to predict the focus word (the word we are interested in), it checks the surrounding words around it. The contextual representation of surrounding words for the focus word, helps in clearly predicting the word. It takes a pre-defined fixed window size into account and tries to predict the target word.

Example- Suppose we have a sentence – “I eat pizza every day”. So, if we have a context window of 2, the input will be [“I”, “pizza”] and the target will be “eat”.

CBOW

In the above diagram, W(t-2) and W(t-1) represent the words before our focus word, i.e., [‘I’, ‘eat’]. And W(t+1), W(t+2) represent the words after focus word, i.e., [‘every’, ‘day’]. These four words are used to predict the focus word, that is, “pizza”. So when given the task to get embedding for each word in the vocabulary by following Continuous Bag of Words, the context_window is taken as 2, the (context_words, focus word) pairs would be:

([‘I’, ‘pizza’], ‘eat’), ([‘eat’, ‘every’], ‘pizza’), ([‘pizza’, ‘day’], ‘every’).

(Note: You can choose the context window of your own choice)

Skip-gram– Skip gram technique is similar to the continuous bag of words but the main difference is instead of predicting the target word from the context, this takes the target word as input and tries to predict the context (which is a set of words). Example- If we take the above sentence, the input will be “eat” and the target will be [“I”, “pizza”].

Skip-gram

Here, the context words for the focus word is predicted. As seen from the diagram above, for each input word the neighbouring context words will be predicted. To get multiple outputs from a single input, a SoftMax activation is used to assign probability for the context words. We will consider the same context window of 1 and see how the (context word, focus words) look like:

(‘eat’, [‘I’, ‘pizza’]), (‘pizza’, [‘eat’, ‘every’]), (‘every’, [‘pizza’, ‘day’])

Word Embedding in Pytorch

Word Embedding is a powerful concept that helps in solving most of the natural language processing problems. As the machine doesn’t understand raw text, we need to transform that into numerical data to perform various operations. The most basic approach is to assign words/ letters a vector that is unique to them but this approach is not very useful as the words with similar meanings will get completely different vectors. Another more useful approach is training a model that can generate vectors of words. This is better than the previous approach because it will group similar words together and generate similar vectors for them. It also captures the overall meaning/ context of the words and sentences which is better than random assignment of vectors.

Similar Reads

Embedding

Embeddings are real-valued dense vectors (multi-dimensional arrays) that carry the meaning of the words. They can capture the context of the word/sentence in a document, semantic similarity, relation with other words/sentences, etc. A popular example of how they extract the contexts from the words is if you remove a man from the king and add a woman, it will output a vector similar to a queen. Also, similar words are close to each other in the embedding space. Many pre-trained models are available such as Word2Vec, GloVe, Bert, etc....

Model Building in Pytorch

Before we dive into the modelling building, lets first understand how the Embedding layer syntax works in Pytorch. It is give as:...