GloVe

Given by Stanford, GloVe stands for Global Vectors for Word Representation. It is a popular word embedding model which works on the basic idea of deriving the relationship between words using statistics. It is a count based model that employs co-occurrence matrix. A co-occurrence matrix tells how often two words are occurring globally. Each value is a count of a pair of words occurring together.

Glove basically deals with the spaces where the distance between words is linked to to their semantic similarity. It has properties of the global matrix factorisation and the local context window technique. Training of the model is based on the global word-word co-occurrence data from a corpse, and the resultant representations results into linear substructure of the vector space

GloVe calculates the co-occurrence probabilities for each word pair. It divides the co-occurrence counts by the total number of co-occurrences for each word:

For example, the co-occurrence probability of “cat” and “mouse” is calculated as: Co-occurrence Probability(“cat”, “mouse”) = Count(“cat” and “mouse”) / Total Co-occurrences(“cat”)

In this case:

Count("cat" and "mouse") = 1
Total Co-occurrences("cat") = 2 (with "chases" and "mouse")
So, Co-occurrence Probability("cat", "mouse") = 1 / 2 = 0.5

GloVe Model Building

Firstly, download gloVe 6B embeddings from this site. Then unzip the file and add the file to the same folder as your code. There are many variations of the 6B model but we’ll using the glove.6B.50d.

Python3

#importing libraries
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
import numpy as np
 
x = {'processing', 'the', 'world', 'prime',
    'natural', 'language'}
 
# create the dictionary
tokenizer = Tokenizer()
tokenizer.fit_on_texts(x)
 
#this will print the dictionary of the words mapped with their indexes
print("Dictionary is = ", tokenizer.word_index)
 
def embedding_vocab(filepath, word_index,embedding_dim):
    vocab_size = len(word_index) + 1
     
 
    embedding_matrix_vocab = np.zeros((vocab_size,
                                       embedding_dim))
 
    with open(filepath, encoding="utf8") as f:
        for line in f:
            word, *vector = line.split()
            if word in word_index:
                idx = word_index[word]
                embedding_matrix_vocab[idx] = np.array(
                    vector, dtype=np.float32)[:embedding_dim]
    return embedding_matrix_vocab
#matrix  
embedding_dim = 50
embedding_matrix_vocab = embedding_vocab(
    'glove.6B.50d.txt', tokenizer.word_index,
embedding_dim)
 
print("embedding for first word is => ",embedding_matrix_vocab[1])

Output:

Dictionary is =
('the': 1, 'world': 2, 'processing': 3, 'prime': 4, 'language': 5, 'natural': 6} 
Dense vector for first word is => [4.180000130-01 2.49679998e-01 -4.12420005e-01 1.216999960-01
3.45270008e-01 -4.44569997e-02 -4.96879995e-01 -1.78619996-01
 -6.60229998e-04 -6.56599998e-01 2.78430015e-01 -1.47670001-01
 -5.56770027e-01 1.46579996e-01 -9.50950012e-03 1.16579998e-02
 1.02040000e-01 -1.27920002e-01 -8.44299972e-01 -1.21809997e-01
 -1.68009996e-02 -3.32789987e-01 -1.55200005e-01 -2.31309995e-01
 -1.91809997e-01 -1.88230002e+00 -7.67459989e-01 9.90509987e-02
 -4.21249986e-01 -1.95260003e-01 4.00710011e+00 1.85939997e-01
 -5.22870004e-01 -3.16810012e-01 5.92130003e-04 7.44489999e-03
 1.77780002e-01 -1.58969998e-01 1.20409997e-02 -5.42230010e-02
 -2.98709989e-01 -1.57490000e-01 -3.47579986e-01 -4.56370004e-02
 -4.42510009e-01 1.87849998e-01 2.78489990e-03 -1.84110001e-01
 -1.15139998e-01 -7.85809994e-01]

Pre-Trained Word Embedding in NLP

Word Embedding is an important term in Natural Language Processing and a significant breakthrough in deep learning that solved many problems. In this article, we’ll be looking into what pre-trained word embeddings in NLP are.

Table of Content

Word Embeddings
Challenges in building word embedding from scratch
Pre Trained Word Embeddings
Word2Vec
GloVe
BERT Embeddings

GloVe

GloVe Model Building

Python3

Pre-Trained Word Embedding in NLP

Table of Content

Similar Reads