Implementing Feature Extraction using HuggingFace Model
We are going to initialize a feature extraction pipeline using the BERT model, processes the input text “Geeks for Geeks” through the pipeline to extract features.
For this implementation, we need to install transformers library:
pip install transformers
Step 1: Import Necessary Library
Importing the pipeline
function from the transformers
library. The function loads the pre-trained model and use it for NLP tasks.
from transformers import pipeline
Step 2: Define BERT checkpoint
‘bert-base-uncased'
is a version of BERT (Bidirectional Encoder Representations from Transformers) that converts all text to lowercase and removes any casing information. Here, we have specified that we want to use BERT pre-trained model.
checkpoint = "bert-base-uncased"
Step 3: Initialize Feature Extraction pipeline
Then we create a feature extraction pipeline using the BERT model. The framework="pt"
specifies that PyTorch is being used.
feature_extractor = pipeline("feature-extraction", framework="pt", model=checkpoint)
Step 4: Feature Extraction
Now, we will input the text to extract features. After initializing the feature extraction pipeline, the text is processed through the BERT model, resulting in a PyTorch tensor containing the extracted features. To convert this tensor into a more manageable format, such as a NumPy array, the .
numpy()
method is applied. Then, the mean()
function is used along the first dimension of the array to calculate the average feature value across all tokens in the input text. This results in a single 768-dimensional vector, where each value represents the average feature value extracted by BERT. This vector serves as a numerical representation of the input text’s semantic content and can be utilized in various downstream tasks such as text classification, clustering, or similarity calculations.
text = "Geeks for Geeks"
features = feature_extractor(text, return_tensors="pt")[0]
reduced_features = features.numpy().mean(axis=0)
Complete Code to extract features using BERT Model
from transformers import pipeline
# Define the BERT model checkpoint
checkpoint = "bert-base-uncased"
# Initialize the feature extraction pipeline
feature_extractor = pipeline("feature-extraction", framework="pt", model=checkpoint)
# Define the text
text = "Geeks for Geeks"
# Extract features
features = feature_extractor(text, return_tensors="pt")[0]
# Convert to numpy array and reduce along the first dimension
reduced_features = features.numpy().mean(axis=0)
print(reduced_features)
Output:
[ 5.02510428e-01 -2.45701224e-02 2.26838857e-01 2.30424330e-01
-1.38328627e-01 -2.84000754e-01 1.10542558e-01 4.50471163e-01
...
-1.96653694e-01 -2.78628379e-01 1.52640432e-01 4.47542313e-03
-2.00327083e-01 7.34994039e-02 2.04465240e-01 -1.33181065e-01]
Text Feature Extraction using HuggingFace Model
Text feature extraction converts text data into a numerical format that machine learning algorithms can understand. This preprocessing step is important for efficient, accurate, and interpretable models in natural language processing (NLP). We will discuss more about text feature extraction in this article.