Dive into Deep Learning
Table Of Contents
Dive into Deep Learning
Table Of Contents

Multiscale Object Detection

In the “Anchor Box” section, we generated multiple anchor boxes centered on each pixel of the input image. These anchor boxes are used to sample different regions of the input image. However, if anchor boxes are generated centered on each pixel of the image, soon there will be too many anchor boxes for us to compute. For example, we assume that the input image has a height and a width of 561 and 728 pixels respectively. If five different shapes of anchor boxes are generated centered on each pixel, over two million anchor boxes (\(561 \times 728 \times 5\)) need to be predicted and labeled on the image.

It is not difficult to reduce the number of anchor boxes. An easy way is to apply uniform sampling on a small portion of pixels from the input image and generate anchor boxes centered on the sampled pixels. In addition, we can generate anchor boxes of varied numbers and sizes on multiple scales. Notice that smaller objects are more likely to be positioned on the image than larger ones. Here, we will use a simple example: Objects with shapes of \(1 \times 1\), \(1 \times 2\), and \(2 \times 2\) may have 4, 2, and 1 possible position(s) on an image with the shape \(2 \times 2\). Therefore, when using smaller anchor boxes to detect smaller objects, we can sample more regions; when using larger anchor boxes to detect larger objects, we can sample fewer regions.

To demonstrate how to generate anchor boxes on multiple scales, let us read an image first. It has a height and width of 561 * 728 pixels.

In [1]:
%matplotlib inline
import gluonbook as gb
from mxnet import contrib, image, nd

img = image.imread('../img/catdog.jpg')
h, w = img.shape[0:2]
h, w
Out[1]:
(561, 728)

In the “Two-Dimensional Convolutional Layer” section, the 2D array output of the convolutional neural network (CNN) is called a feature map. We can determine the midpoints of anchor boxes uniformly sampled on any image by defining the shape of the feature map.

The function display_anchors is defined below. We are going to generate anchor boxes anchors centered on each unit (pixel) on the feature map fmap. Since the coordinates of axes \(x\) and \(y\) in anchor boxes anchors have been divided by the width and height of the feature map fmap, values between 0 and 1 can be used to represent relative positions of anchor boxes in the feature map. Since the midpoints of anchor boxes anchors overlap with all the units on feature map fmap, the relative spatial positions of the midpoints of the anchors on any image must have a uniform distribution. Specifically, when the width and height of the feature map are set to fmap_w and fmap_h respectively, the function will conduct uniform sampling for fmap_h rows and fmap_w columns of pixels and use them as midpoints to generate anchor boxes with size s (we assume that the length of list s is 1) and different aspect ratios (ratios).

In [2]:
gb.set_figsize()

def display_anchors(fmap_w, fmap_h, s):
    fmap = nd.zeros((1, 10, fmap_w, fmap_h))  # The values from the first two dimensions will not affect the output.
    anchors = contrib.nd.MultiBoxPrior(fmap, sizes=s, ratios=[1, 2, 0.5])
    bbox_scale = nd.array((w, h, w, h))
    gb.show_bboxes(gb.plt.imshow(img.asnumpy()).axes, anchors[0] * bbox_scale)

We will first focus on the detection of small objects. In order to make it easier to distinguish upon display, the anchor boxes with different midpoints here do not overlap. We assume that the size of the anchor boxes is 0.15 and the height and width of the feature map are 4. We can see that the midpoints of anchor boxes from the 4 rows and 4 columns on the image are uniformly distributed.

In [3]:
display_anchors(fmap_w=4, fmap_h=4, s=[0.15])
../_images/chapter_computer-vision_multiscale-object-detection_5_0.svg

We are going to reduce the height and width of the feature map by half and use a larger anchor box to detect larger objects. When the size is set to 0.4, overlaps will occur between regions of some anchor boxes.

In [4]:
display_anchors(fmap_w=2, fmap_h=2, s=[0.4])
../_images/chapter_computer-vision_multiscale-object-detection_7_0.svg

Finally, we are going to reduce the height and width of the feature map by half and increase the anchor box size to 0.8. Now the midpoint of the anchor box is the center of the image.

In [5]:
display_anchors(fmap_w=1, fmap_h=1, s=[0.8])
../_images/chapter_computer-vision_multiscale-object-detection_9_0.svg

Since we have generated anchor boxes of different sizes on multiple scales, we will use them to detect objects of various sizes at different scales. Now we are going to introduce a method based on convolutional neural networks (CNNs).

At a certain scale, suppose we generate \(h \times w\) sets of anchor boxes with different midpoints based on \(c_i\) feature maps with the shape \(h \times w\) and the number of anchor boxes in each set is \(a\). For example, for the first scale of the experiment, we generate 16 sets of anchor boxes with different midpoints based on 10 (number of channels) feature maps with a shape of \(4 \times 4\), and each set contains 3 anchor boxes. Next, each anchor box is labeled with a category and offset based on the classification and position of the ground-truth bounding box. At the current scale, the object detection model needs to predict the category and offset of \(h \times w\) sets of anchor boxes with different midpoints based on the input image.

We assume that the \(c_i\) feature maps are the intermediate output of the CNN based on the input image. Since each feature map has \(h \times w\) different spatial positions, the same position will have \(c_i\) units. According to the definition of receptive field in the “Two-Dimensional Convolutional Layer” section, the \(c_i\) units of the feature map at the same spatial position have the same receptive field on the input image. Thus, they represent the information of the input image in this same receptive field. Therefore, we can transform the \(c_i\) units of the feature map at the same spatial position into the categories and offsets of the \(a\) anchor boxes generated using that position as a midpoint. It is not hard to see that, in essence, we use the information of the input image in a certain receptive field to predict the category and offset of the anchor boxes close to the field on the input image.

When the feature maps of different layers have receptive fields of different sizes on the input image, they are used to detect objects of different sizes. For example, we can design a network to have a wider receptive field for each unit in the feature map that is closer to the output layer, to detect objects with larger sizes in the input image.

We will implement a multiscale object detection model in the following section.

Summary

  • We can generate anchor boxes with different numbers and sizes on multiple scales to detect objects of different sizes on multiple scales.
  • The shape of the feature map can be used to determine the midpoint of the anchor boxes that uniformly sample any image.
  • We use the information for the input image from a certain receptive field to predict the category and offset of the anchor boxes close to that field on the image.

Problems

  • Given an input image, assume \(1 \times c_i \times h \times w\) to be the shape of the feature map while \(c_i, h, w\) are the number, height, and width of the feature map. What methods can you think of to convert this variable into the anchor box’s category and offset? What is the shape of the output?

Discuss on our Forum