Role of Sliding window in LongFormer’s Attention Mechanism

Applications of Sliding Window Attention

LongFormer (Long document transformer) is an upgrade to the previous transformer models such as SpanBERT as it aims to overcome the issues of accepting long sequenced strings (more than 512 tokens) as input. It adapted a CNN like architecture known as Sliding Window Attention to do so. See Fig 2 for better understanding.

Fig 3 : CNN based Sliding Window Attention model

The problem with CNN is that it assumes that one word w could be related to any of the other words w’. Hence, it takes into consideration all the possible combination of words that could be related. So, the time complexity of the computation increases drastically.

As discussed above, the LongFormer calculations are based on the assumption that the most important information related to a given word is present in its surrounding neighbors. So, the given word is allowed access to its left and right neighbors (w/2 on both sides).

See Fig 3 for a better understanding.

Fig 4 : Working of sliding window attention model

Unlike CNN, where there is a full connection, the sliding window approach leads to lesser mappings as only the neighboring words are taken into consideration. Hence the time complexity for the computations is also improved.

Example

Let us create an example using fig 2, and understand the working of this attention model. Consider the image given below.(Fig. 5)

Fig 5 : Intuition Example

Assume that each block in the above adjacency matrix represent one word (token). Let rows represent each input word of a sentence and columns represent key words that require attention. (Here, the window size = 3)

So, as per the sliding window attention model, each input word attends to itself as well as its neighboring key tokens. In reality, each block represents 64 tokens in general. So, on a broader perspective, 64 tokens of input attend to only 192 relevant key tokens instead of considering all key tokens. (Shown in Fig 6) This makes the model much more efficient than the CNN full connection model.

Fig 6 : Real-time working of SWA model

Sliding Window Attention

Sliding Window Attention is a type of attention mechanism used in neural networks. The attention mechanism allows the model to focus on different parts of the input sequence when making predictions, providing a more flexible and content-aware approach.

Prerequisite: Attention Mechanism | ML

A wise man once said, “Manage your attention, not your time and you’ll get things done faster”.

In this article, we will be covering all about the Sliding window attention mechanisms used in Deep Learning as well as the working of the classifier.

Role of Sliding window in LongFormer’s Attention Mechanism

Example

Sliding Window Attention

Similar Reads