Sliding window attention classifier

Role of Sliding window in LongFormer’s Attention Mechanism

In this, a window of size m x n pixels is taken and is traversed through the input image in order to find the target object(s) in that image. The training of the classifier is done by introducing it to a set of positive (containing the target object) and negative (not containing the target object) examples.

Intuition

Fig 1: Face Detection using sliding window

The training is done in such a way that we can capture all the target object(s) present in the image. Fig 1 depicts a face detection model in work. As you can see, the image contains face of various sizes. Another possibility is that some people could be far off while some near, thus changing the size of their faces.

Sliding Window Attention (Intuition Continued)

During Training

The classifier is trained on two sets of classes, one containing the object of interest and the other containing random objects. The samples belonging to our object of interest are referred to as positive examples and one with random objects is referred to as negative examples. This is done so that when new images come during the testing phase, the classifier is able to better detect if the object present in the window is the target object or some other random object with good accuracy.

During Testing

The idea is to use the trained binary classifier, which determines if the presented object is “positive” or “negative. The trained classifier can then be used to determine a target image by sampling it, starting from the top-left corner. Also, we use multiple windows of various sizes just to make sure that all sizes of the target object present in the input image are detected.

Just like face-detection, the sliding window model is also used to efficiently cover long texts of input data. (topic covered in depth below)

Mathematics

Fig 2

In this, we have a fixed window of size w. Each token attends to (1/2)w tokens on each side (as shown in Fig. 2). The time complexity of this pattern thus becomes O(n × w), where n is the input sequence length.

Thus, this attention pattern employs a fixed-size window attention surrounding each token. We use multiple stacked layers of such windowed attention so as to cover a large receptive field, where top layers have access to all input locations. This gives the model the ability to cover the entire input sequence fed to it. (very similar to CNNs)