Attention Mechanism in Longformers

The long transformer proposed 4 types of self attention as shown in below diagram:

Longformer self-attention mechanism

The matrix shows different ways of calculating attention mechanism . Each row and column indicate the words in the sentence. This is N* N matrix where N is the size of input sentence. In the diagram ‘green squares’ indicate word where attention mechanism is calculated and ‘white square’ indicates the word where the attention mechanism is not calculated

Full Attention

This is the basic full attention mechanism of the transformer architecture as discussed above. Here all the squares are green indicating that for each word all the other words are used for calculation of attention mechanism. Thereby the scale of operations is N2(Quadratic). For our example full attention can be visually represented as

Full n^2 Attention

Sliding Window

In sliding window attention we do not compute the attention of each word with respect to all other words. Instead we take a window of length w and calculate the attention mechanism with respect to that window length. For example let say we have a window of length 2. So for sentence ‘I love geeks for geeks articles ‘ , for the third word ‘geeks’ we will calculate the attention vector considering ‘love’ and ‘for’ words only. The computation complexity of this attention window is O(N × w), which scales linearly with input sequence length N. A sliding window can be visually represented as

Sliding Window Attention

Dilated Sliding Window Attention

Also known as spare window the dilation window attention causes gap in the connection patterns as we skip certain token. It is similar to sliding window but here we attend to words with difference of ‘d’ within a window of ‘w'(indicated by white squares in between the green squares). This allows to increase the receptive file without increasing the memory requirement.

Through the utilization of dilated sliding windows, Transformers can prioritize capturing of dependencies and relationships within a limited local context, mitigating the computational complexity typically associated with considering all positions in the input sequence.

Here since the computations are limited within the window computation complexity scales linearly with respect to input .

Dilated Sliding Window Attention

Global Sliding Window Attention

Global Sliding Window Attention is modification of sliding window mechanism in which we allow certain words or tokens to access all the tokens for attention vector calculation (indicated by green horizontal and vertical lines) . The self-attention mechanism of transformer works on both a “local” and “global” context. In the Longformer architecture, most tokens attend “locally” to each other within a specified window size. Tokens look at preceding and succeeding tokens within this local context. A selected few tokens have the ability to attend “globally” to all other tokens in the sequence. These global attention tokens have the capacity to consider information from the entire sequence, as opposed to being limited to a specific window size. It’s important to note that in Longformer’s design, every token that attends locally not only considers tokens within its window but also attends to all globally attending tokens. This ensures that the global attention is symmetric.

Global Sliding Window Attention strikes a middle ground, providing a compromise between computational efficiency and modeling capacity. This makes it well-suited for tasks that require a delicate balance between managing computational resources and capturing contextual information within a restricted context window.

Since global tokens are limited, the computations are limited and computation complexity scales linearly with respect to input.

Global Attention Mechanism

Global Sliding Window and Dilated Attention techniques are designed to enhance the scalability and efficiency of the self-attention process within transformer-based models. They offer innovative alternatives to the conventional self-attention mechanism, striking a balance between computational demands and the ability to capture extensive contextual relationships within input sequence.

The hugging face Longformer model has support for global attention and not for dilated sliding window mechanism

Longformer in Deep Learning

Transformer-based models are really good at understanding and processing text, but they struggle when the text is very long. To address this issue, researchers developed a device known as the “longformer.” It’s a modified Transformer meant to operate well with extremely lengthy bits of text. It accomplishes this by altering how it perceives words.

For the understanding of this article, we will take a running example of a task. Let’s say we want to classify a review written on the Geeks for Geeks website. The length of this review is 1000 words. Since it’s not practical to fit all the words of the review in the article at all the places, we will take a short representation of the review so that it becomes easy to comprehend the concepts presented. Let the review be “I love Geeks for Geeks”.

Similar Reads

LongFormers

Longformers are neural networks designed specially to process and understand long sequences of text or other data. They are able to handle very long sequences and documents with thousand words, without experiencing the computational challenges that Transformers face....

Quadratic Scaling in Self Attention

Lets calculate the number of operation to be done for calculation of output of self...

Attention Mechanism in Longformers

The long transformer proposed 4 types of self attention as shown in below diagram:...

Implementation of Longformers

Let us use the long-transformer to classify the IMDB review dataset into positive or negative. This code was run successfully in google colab using T4 GPU. We will train it on 400 reviews and then use it to classify a new review. Using 200 train data and training epoch of 2 we were able to achieve accuracy of 90 % . If we use entire dataset and train for more epochs we can achieve much better accuracy....