Working of Region Proposal Network (RPN)

RPN is a fully convolutional network that predicts the object bounds by learning from feature maps extracted from a base network. It has a classifier that returns the probability of the region and a regressor that returns the coordinates of bounding boxes.

Anchor boxes

A feature map is extracted from the convolutional neural network layers. Every point on this feature map is considered. Every point on the feature map is called an anchor.

  • Anchor boxes are boxes generated at image dimension, based on the aspect ratios and scales of the images. Aspect ratio is the ratio of width to height of the image and scale is the size of the image.
  • The anchor boxes are positioned (in centers) on each anchor point of the convolutional feature map as shown in the image below. Each anchor box is slid across the entire feature map. You can imagine this like a sliding window. With each point as the center of the anchor box, it produces a prediction of whether the box has an object or not.

Therefore, if we have k anchor boxes, then for every position on the feature map, we will have k predictions. These predictions are essentially binary classifications of foreground and background.

Intersection-Over-Union (IoU)

The foreground and background labels are assigned based on a metric called Intersection over Union (IoU) that determines the amount of overlap of the anchor box with the object of interest. It is calculated as the ratio of the area of intersection between the anchor box and the box with an area of interest to the area of the union of the two boxes. IoU > 0.7 usually implies a foreground.

Note: The bounding box classifier in RPN does not tell us whether an object is an animal or a vehicle. It only tells us if the part of the image is background or foreground.

Regression

Values which are the coordinates of the center, width, and height respectively of the bounding box. Since it is a learning model, it has a cost function that can be defined as the sum of classification loss and regression loss. It can be written as follows:

[Tex]L = L_{cls} + L_{reg} [/Tex]

[Tex]L (p_{i}, t_{i}) = \frac {1}{N_{cls}} \sum_{i} L_{cls} (p_{i}, p_{i}^*) + \lambda \frac {1}{N_{reg}} \sum_{i} L_{reg} (p_{i}, p_{i}^*) [/Tex]

[Tex]where,\; i = index \; of \; anchor , \\ p = probability\; of\; being\; an\; object\; or\; not,\\ t = vector\; of\; parameterized\; coordinates\; of\; predicted\; box, \\ * = the\; target\; box,\\ Ncls\; and\; Nreg = normalizations [/Tex]

The offsets thus obtained, are applied to get the RoIs, which are further processed in the object detection. In a nutshell, the RPN proposes a bunch of boxes that are classified as background or foreground, and the foreground anchors boxes are further refined to obtain the regions of interest.



Region Proposal Network (RPN) in Object Detection

In recent times Object Detection Algorithms have evolved manifold and this has led to many advancements in the applications which helped us solve real-world problems with the utmost efficiency and latency of real-time. In this article, we will look a Region Proposal Networks which serve as an important milestone in the advancements of Object Detection Algorithms.

Table of Content

  • What is Object Detection?
  • Region Proposal in R-CNN family
  • Working of Region Proposal Network (RPN)

Similar Reads

What is Object Detection?

...

Region Proposal in R-CNN family

Object Detection is a computer vision technique that is used for locating objects in a digital image or video, and identifying (or classifying) them. It can be done using single-stage approaches as well as two-stage. Each approach has its pros and cons. Typically, the two stages of object detection are:...

Working of Region Proposal Network (RPN)

R-CNN stands for Region-based Convolutional Neural Network. It is a family of machine learning models used for computer vision tasks, specifically object detection. Traditionally, object detection was done by scanning every grid position of an image using different sizes of frames to identify the object’s location and class. Applying CNN on every frame took a very long time. R-CNN reduced this problem. It uses Selective Search to select the candidate region and then applies CNN to each region proposal. However, it was still slow due to the repeated application of CNN on overlapping candidate regions....