Quadratic Scaling in Self Attention

Lets calculate the number of operation to be done for calculation of output of self

  1. For a single word to calculate its attention vector we need to do Q*K (dot multiplication) followed by softmax and multiply the value of softmax with the Value vector . For a single word ‘I’ this needs to be one with each other word of sentence(“I love Geeks for Geeks”). Thus a single word has N operations.
  2. The single word N operation has to be done for each word to calculate attention vectors of each word in the sentence. Since there are N words this will have to be done N times.

Hence we need to do N2 operations fore each sentence of length N which is quadratic in scale.

The quadratic scaling of operation of transformer with respective to input size makes it inefficient for processing long sentences/documents.This becomes very large and consumes large amount of memory. Hence it posses challenge for processing long sentences. The standard BERT model is able to process 512 tokens. Any long document with more than 512 words had to be either truncated or chunked which lead to information loss or cascading errors respectively.

This is were the long transformer comes in. It scales linearly with the input size thereby allowing 4 times the token size .

Longformer in Deep Learning

Transformer-based models are really good at understanding and processing text, but they struggle when the text is very long. To address this issue, researchers developed a device known as the “longformer.” It’s a modified Transformer meant to operate well with extremely lengthy bits of text. It accomplishes this by altering how it perceives words.

For the understanding of this article, we will take a running example of a task. Let’s say we want to classify a review written on the Geeks for Geeks website. The length of this review is 1000 words. Since it’s not practical to fit all the words of the review in the article at all the places, we will take a short representation of the review so that it becomes easy to comprehend the concepts presented. Let the review be “I love Geeks for Geeks”.

Similar Reads

LongFormers

Longformers are neural networks designed specially to process and understand long sequences of text or other data. They are able to handle very long sequences and documents with thousand words, without experiencing the computational challenges that Transformers face....

Quadratic Scaling in Self Attention

Lets calculate the number of operation to be done for calculation of output of self...

Attention Mechanism in Longformers

The long transformer proposed 4 types of self attention as shown in below diagram:...

Implementation of Longformers

Let us use the long-transformer to classify the IMDB review dataset into positive or negative. This code was run successfully in google colab using T4 GPU. We will train it on 400 reviews and then use it to classify a new review. Using 200 train data and training epoch of 2 we were able to achieve accuracy of 90 % . If we use entire dataset and train for more epochs we can achieve much better accuracy....