Website Summarizer

Summarization using Hugging Face Transformer

A Website Summarizer is a vital tool for summarizing and reducing web page information. Offering succinct and cohesive summaries, enables visitors to easily comprehend the major concepts and vital information from lengthy articles, blog entries, or news reports. Website Summarizers examine the source material, identify essential lines or phrases, and provide summaries that encapsulate the core of the information using powerful natural language processing algorithms. This not only saves time but also improves the user’s capacity to make educated decisions, whether it’s staying up to speed on current events or performing research. Website Summarizers are extremely useful in today’s information age, making internet content more accessible and controllable.

Types of Summarizers

There are two types of summarizers out there:

Extractive Summarizers: Extractive summarizers select and condense important sentences or phrases directly from the source text. They do not generate new sentences but rather pull relevant portions from the original content. This technique is not very precise and meaningful in the real world.
Abstractive Summarizers: These summarizers generate summaries in a more human-like manner, producing original sentences to convey the originality of the content. They often rephrase and rewrite sentences. This technique is meaningful and feels like human generated. Abstractive summarization tends to be more contextually accurate.

This article focuses on abstractive summarizers. We will be developing a real-time summarizer that extracts meaningful and human-like summaries from the given website URL.

What is BART?

BART, also known as Bidirectional and Auto Regressive Transformers stands out as a language model created by Facebook AI. It falls under the category of sequence-to-sequence (seq2seq) models enabling it to be trained for tasks involving converting one set of data into another. These tasks include machine translation, text summarization and question answering.

What sets BART apart is its training process. It undergoes training on a dataset consisting of text and code using an objective. In terms of BART is trained to reconstruct text that has been altered in some way such as, through sentence shuffling or replacing sections with a mask token. This pre-training approach equips BART with an understanding of language structure and meaning which empowers it to excel in real-world applications.

BART’s Architecture

BART functions, as a model that operates in a sequence, to sequence manner. In terms it takes a series of input tokens. Generates a corresponding series of output tokens. The model consists of two components; an encoder and a decoder:

The encoder is designed as a Transformer meaning it analyzes the input tokens, in both directions. This enables the encoder to understand how different tokens in the input sequence are related to each other over distances.
The decoder operates as a right Transformer processing the output tokens sequentially. This ensures that the decoder generates an output sequence that makes sense and flows coherently.

BART encoder-decoder network architecture

Decoders consist of multiple layers of attention and feed-forward neural networks. The attention layers enable the model to grasp connections between tokens in both input and output sequences. The feed-forward neural networks enable the model to learn relationships, among these tokens.

BART’s Working

BART undergoes training using a denoising autoencoder objective, which means it is taught to reconstruct a modified version of the input sequence.
The modified input sequence is formed by introducing variations to the input sequence.
These variations can take forms, such, as removing tokens adding random tokens or changing the order of tokens randomly.
The objective of the denoising autoencoder encourages BART to learn a representation of the input sequence that remains robust in the presence of noise.
This implies that BART can restore the input sequence even if it has been distorted by noise.
As a result BART proves to be highly suitable for natural language processing tasks like text summarization, machine translation and question answering where the input data is often prone, to containing noise.