Summarization using Hugging Face Transformer
The summarization task offered by Hugging Face Transformers involves creating a brief and logical summary of a longer piece of text or document. This task falls under the wider scope of natural language processing (NLP) and is especially valuable for condensing information, improving readability for readers, or extracting important insights from long articles, documents, or web pages.
Hugging Face Transformer
The Hugging Face Transformers, commonly known as “Transformers,” is a freely available library and system for deep learning and natural language processing (NLP). It offers a diverse array of pre-trained models, such as BERT, GPT, RoBERTa, and others, based on transformers. These models cater to various NLP tasks, including text categorization, language creation, identifying named entities, and machine translation.
Tools Required
- A text editor (VS Code is recommended)
- Latest version of Python
- A web browser (Google Chrome is recommended)
- Internet Connection (For installing packages)
Building Website Summarizer
Creating Workspace
- Create a new folder on your PC and open it in your editor (VS Code).
- Create a new file named “app.py” in this newly created folder. (This is the main file where we do our work)
Required Packages
transformers: This package uses Natural Language Processing under the hood and summarizes the input text using Transformer architecture.
pip install transformers
tensorflow: This package is needed for transformers to work.
pip install tensorflow
requests: This package is used to make a “GET” request on the given website URL for extracting text.
pip install requests
bs4: This package is used for scraping the content in a given website for summarizing.
lxml: This is used for processing XML and HTML documents.
pip install bs4
pip install lxml
streamlit: This package is used for designing a GUI (Graphical User Interface) thus making an interactive fully functional application.
pip install streamlit
Install all the above packages using pip in the same order mentioned (Use Virtual Environment if you get any issues in the installation)
Importing Packages
Python3
from transformers import pipeline import requests from bs4 import BeautifulSoup import re import streamlit as st |
Analysis:
- Here, another inbuilt package is also imported.
- re (Regular Expressions) is used for removing text from scrapped text which is not useful for the summary.
Extracting Text
Python3
def extractText(url): response = requests.get(url) if response.status_code = = 200 : soup = BeautifulSoup(response.text, 'lxml' ) excludeList = [ 'disclaimer' , 'cookie' , 'privacy policy' ] includeList = soup.find_all( [ 'h1' , 'h2' , 'h3' , 'h4' , 'h5' , 'h6' , 'p' ]) elements = [element for element in includeList if not any ( keyword in element.get_text().lower() for keyword in excludeList)] text = " " .join([element.get_text() for element in elements]) text = re.sub(r '\n\s*\n' , '\n' , text) return text else : return "Error in response" |
Code Analysis:
- A function extractText is created which returns the text from a given website URL.
- Using the requests library, a GET request is made which returns the body of the website.
- Then by using the Beautiful Soup library, all the headings and paragraphs from the website are extracted and joined into a single text which is then returned by the function.
- Empty lines and strings are removed from text using re package.
- Some of the blocks such as “Disclaimer”, and “Cookies” are removed from the extracted text. You can modify this as per your needs.
Splitting Text Into Chunks
Python3
def splitTextIntoChunks(text, chunk_size = 1024 ): chunks = [] for i in range ( 0 , len (text), chunk_size): chunk = text[i:i + chunk_size] chunks.append(chunk) return chunks |
Code Analysis:
- A function splitTextIntoChunks is created which returns the chunks of text from a given long text.
- It first loops through every character and appends a text string of chunk_size (Defaulted to 1024) to the chunks[] list and is returned by the function.
Note: The model we are using is “facebook/bart-large-cnn” which takes a max of 1024 tokens. So 1024 is specified as chunk_size. Adjust the chunk_size parameter according to your model needs.
Summarizing Text
Python3
def summarize(text, chunk_size = 1024 , chunk_summary_size = 128 ): summarizer = pipeline( "summarization" , model = "facebook/bart-large-cnn" ) chunks = splitTextIntoChunks(text, chunk_size) summaries = [] for chunk in chunks: size = chunk_summary_size if ( len (chunk) < chunk_summary_size): size = len (chunk) / 2 summary = summarizer(chunk, min_length = 1 , max_length = size)[ 0 ][ "summary_text" ] summaries.append(summary) concatenated_summary = "" for summary in summaries: concatenated_summary + = summary + " " return concatenated_summary |
Code Analysis:
- A function summarize is created which returns the summary text from the input text.
- Using the pipeline function of the transformer, the task is specified as “summarization” and the model as “facebook/bart-large-cnn” which is quite efficient and powerful for summarization tasks.
- The input text is divided into chunks and each chunk is summarized using the summarizer function.
- Here, chunk_summary_size is defaulted to 128 characters. This is the length of the summary for each text chunk. You can adjust it accordingly.
- The individual summaries are then concatenated and returned by the function.
Note: After this function is run, a TensorFlow model sized 1.63GB gets installed on your machine.
Summary Generations
Python3
text = extractText(url) summarize(text) |
Output:
'Natural Language Processing is a subset of artificial intelligence. It enables machines to comprehend and analyze human languages.
In NLP we need to perform some extra processing steps. NLP software mainly works at the sentence level and it also expects words to be separate.
We will see some of the ways of collecting data if it is not available in our local machine or database. In NLP this process of feature engineering is
known as Text Representation or Text Vectorization. In the traditional approach, we create a vocabulary of unique words assign a unique id
(integer value) for each word. Bag of n-gram tries to solve this problem by breaking text into chunks of n continuous words.
N-gram representations are in the form of a sparse matrix, where each row represents a sentence and each column represents an n-gram in the vocabulary.
TF-IDF tries to quantify the importance of a given word relative to the other word in the corpus.
The value in the vector represents the measurements of some features or quality of the word. This is not interpretable for humans but Just for
representation purposes. We can understand this with the help of the below table. Heuristic-based approach is also used for the data-gathering
tasks for ML/DL model. Regular expressions are largely used in this type of model. Recurrent neural networks are a class of artificial neural networks.
The basic concept of RNNs is that they analyze input sequences one element at a time while maintaining track in a hidden state that contains a summary
of the sequence’s previous elements. This enables the RNN to process data from sources like natural languages, where context is crucial.
Long Short-Term Memory (LSTM) is an advanced form of RNN model. LSTMs function by selectively passing or retaining information from one-time
step to the next. Gated Recurrent Unit (GRU) is also the advanced form of RNN. GRUs also have gating mechanisms that allow them to selectively
update or forget information from the previous time steps. '
Creating GUI
Python3
st.title( "Website Summarizer" ) url = st.text_input( "Enter the website URL" ) if st.button( "Summarize" ): if url: try : info_text = st.empty() info_text.info( "Extracting text from the website..." ) article = extractText(url) info_text.info( "Summarizing the text..." ) summarized = summarize(article) info_text.info( "Adding final touches..." ) finalSummary = summarize(summarized) info_text.empty() st.header( "Summarized Text" ) st.write(finalSummary) except Exception as e: st.error( "An error occurred. Please check the URL or try again later." ) else : st.warning( "Please enter a valid website URL." ) |
Code Analysis:
- Using streamlit, a title and text box are specified.
- A button named “Summarize” is created, which when clicked, first the text will be extracted from the website, and then summarized.
- Then, the summarize function is called again on the summary generated to make the final text condensed and meaningful.
- Info messages are also displayed to the user to make the application interactive.
- Finally, the summary is shown below the text box and info messages are hidden on the application.
Complete Code Implementation
Python3
from transformers import pipeline import requests from bs4 import BeautifulSoup import re import streamlit as st def extractText(url): response = requests.get(url) if response.status_code = = 200 : soup = BeautifulSoup(response.text, 'lxml' ) excludeList = [ 'disclaimer' , 'cookie' , 'privacy policy' ] includeList = soup.find_all( [ 'h1' , 'h2' , 'h3' , 'h4' , 'h5' , 'h6' , 'p' ]) elements = [element for element in includeList if not any ( keyword in element.get_text().lower() for keyword in excludeList)] text = " " .join([element.get_text() for element in elements]) text = re.sub(r '\n\s*\n' , '\n' , text) return text else : return "Error in response" def splitTextIntoChunks(text, chunk_size = 1024 ): chunks = [] for i in range ( 0 , len (text), chunk_size): chunk = text[i:i + chunk_size] chunks.append(chunk) return chunks def summarize(text, chunk_size = 1024 , chunk_summary_size = 128 ): summarizer = pipeline( "summarization" , model = "facebook/bart-large-cnn" ) chunks = splitTextIntoChunks(text, chunk_size) summaries = [] for chunk in chunks: size = chunk_summary_size if ( len (chunk) < chunk_summary_size): size = len (chunk) / 2 summary = summarizer(chunk, min_length = 1 , max_length = size)[ 0 ][ "summary_text" ] summaries.append(summary) concatenated_summary = "" for summary in summaries: concatenated_summary + = summary + " " return concatenated_summary st.title( "Website Summarizer" ) url = st.text_input( "Enter the website URL" ) if st.button( "Summarize" ): if url: try : info_text = st.empty() info_text.info( "Extracting text from the website..." ) article = extractText(url) info_text.info( "Summarizing the text..." ) summarized = summarize(article) info_text.info( "Adding final touches..." ) finalSummary = summarize(summarized) info_text.empty() st.header( "Summarized Text" ) st.write(finalSummary) except Exception as e: st.error( "An error occurred. Please check the URL or try again later." ) else : st.warning( "Please enter a valid website URL." ) |
The final application can be run and built using the below command in the terminal.
After running the command it will give you a localhost URL where the application can be accessed locally in the system and a Network URL where the application can be accessed anywhere on the internet, copy and paste any of the above two URLs in your browser to access the application.
Here is the website that is displayed after running the above command.
streamlit run app.py
Output:
Video Output
Website Summarizer using BART
In an age where information is abundantly available on the internet, the need for efficient content consumption has never been greater. This insightful article explores the development of a cutting-edge web-based application for summarizing website content, all thanks to the powerful capabilities of the Hugging Face Transformer model.
When you like an article but it’s super long, it can be hard to find the time to read the whole thing. That’s where a summarizer can be a real lifesaver. It’s a tool that gives you a short and sweet version of the article, so you can quickly get the main points without spending too much time.