Dive into Deep Learning
Table Of Contents
Dive into Deep Learning
Table Of Contents

Single Shot Multibox Detection (SSD)

In the previous few sections, we have introduced bounding boxes, anchor boxes, multiscale object detection, and data sets. Now, we will use this background knowledge to construct an object detection model: single shot multibox detection (SSD)[1]. This quick and easy model is already widely used. Some of the design concepts and implementation details of this model are also applicable to other object detection models.


Figure 9.4 shows the design of an SSD model. The model’s main components are a base network block and several multiscale feature blocks connected in a series. Here, the base network block is used to extract features of original images, and it generally takes the form of a deep convolutional neural network. The paper on SSDs chooses to place a truncated VGG before the classification layer[1], but this is now commonly replaced by ResNet. We can design the base network so that it outputs larger heights and widths. In this way, more anchor boxes are generated based on this feature map, allowing us to detect smaller objects. Next, each multiscale feature block reduces the height and width of the feature map provided by the previous layer (for example, it may reduce the sizes by half). The blocks then use each element in the feature map to expand the receptive field on the input image. In this way, the closer a multiscale feature block is to the top of Figure 9.4 the smaller its output feature map, and the fewer the anchor boxes that are generated based on the feature map. In addition, the closer a feature block is to the top, the larger the receptive field of each element in the feature map and the better suited it is to detect larger objects. As the SSD generates different numbers of anchor boxes of different sizes based on the base network block and each multiscale feature block and then predicts the categories and offsets (i.e., predicted bounding boxes) of the anchor boxes in order to detect objects of different sizes, SSD is a multiscale object detection model.



Next, we will describe the implementation of the modules in the figure. First, we need to discuss the implementation of category prediction and bounding box prediction.

Category Prediction Layer

Set the number of object categories to \(q\). In this case, the number of anchor box categories is \(q+1\), with 0 indicating an anchor box that only contains background. For a certain scale, set the height and width of the feature map to \(h\) and \(w\), respectively. If we use each element as the center to generate \(a\) anchor boxes, we need to classify a total of \(hwa\) anchor boxes. If we use a fully connected layer (FCN) for the output, this will likely result in an excessive number of model parameters. Recall how we used convolutional layer channels to output category predictions in the “Network in Network (NiN)” section. SSD uses the same method to reduce the model complexity.

Specifically, the category prediction layer uses a convolutional layer that maintains the input height and width. Thus, the output and input have a one-to-one correspondence to the spatial coordinates along the width and height of the feature map. Assuming that the output and input have the same spatial coordinates \((x,y)\), the channel for the coordinates \((x,y)\) on the output feature map contains the category predictions for all anchor boxes generated using the input feature map coordinates \((x,y)\) as the center. Therefore, there are \(a(q+1)\) output channels, with the output channels indexed as \(i(q+1) + j\) (\(0 \leq j \leq q\)) representing the predictions of the category index \(j\) for the anchor box index \(i\).

Now, we will define a category prediction layer of this type. After we specify the parameters \(a\) and \(q\), it uses a \(3\times3\) convolutional layer with a padding of 1. The heights and widths of the input and output of this convolutional layer remain unchanged.

In [1]:
%matplotlib inline
import gluonbook as gb
from mxnet import autograd, contrib, gluon, image, init, nd
from mxnet.gluon import loss as gloss, nn
import time

def cls_predictor(num_anchors, num_classes):
    return nn.Conv2D(num_anchors * (num_classes + 1), kernel_size=3,

Bounding Box Prediction Layer

The design of the bounding box prediction layer is similar to that of the category prediction layer. The only difference is that, here, we need to predict 4 offsets for each anchor box, rather than \(q+1\) categories.

In [2]:
def bbox_predictor(num_anchors):
    return nn.Conv2D(num_anchors * 4, kernel_size=3, padding=1)

Concatenating Predictions for Multiple Scales

As we mentioned, SSD uses feature maps based on multiple scales to generate anchor boxes and predict their categories and offsets. Because the shapes and number of anchor boxes centered on the same element differ for the feature maps of different scales, the prediction outputs at different scales may have different shapes.

In the following example, we use the same batch of data to construct feature maps of two different scales, Y1 and Y2. Here, Y2 has half the height and half the width of Y1. Using category prediction as an example, we assume that each element in the Y1 and Y2 feature maps generates five (Y1) or three (Y2) anchor boxes. When there are 10 object categories, the number of category prediction output channels is either \(5\times(10+1)=55\) or \(3\times(10+1)=33\). The format of the prediction output is (batch size, number of channels, height, width). As you can see, except for the batch size, the sizes of the other dimensions are different. Therefore, we must transform them into a consistent format and concatenate the predictions of the multiple scales to facilitate subsequent computation.

In [3]:
def forward(x, block):
    return block(x)

Y1 = forward(nd.zeros((2, 8, 20, 20)), cls_predictor(5, 10))
Y2 = forward(nd.zeros((2, 16, 10, 10)), cls_predictor(3, 10))
(Y1.shape, Y2.shape)
((2, 55, 20, 20), (2, 33, 10, 10))

The channel dimension contains the predictions for all anchor boxes with the same center. We first move the channel dimension to the final dimension. Because the batch size is the same for all scales, we can convert the prediction results to binary format (batch size, height \(\times\) width \(\times\) number of channels) to facilitate subsequent concatenation on the 1st dimension.

In [4]:
def flatten_pred(pred):
    return pred.transpose((0, 2, 3, 1)).flatten()

def concat_preds(preds):
    return nd.concat(*[flatten_pred(p) for p in preds], dim=1)

Thus, regardless of the different shapes of Y1 and Y2, we can still concatenate the prediction results for the two different scales of the same batch.

In [5]:
concat_preds([Y1, Y2]).shape
(2, 25300)

Height and Width Downsample Block

For multiscale object detection, we define the following down_sample_blk block, which reduces the height and width by 50%. This block consists of two \(3\times3\) convolutional layers with a padding of 1 and a \(2\times2\) maximum pooling layer with a stride of 2 connected in a series. As we know, \(3\times3\) convolutional layers with a padding of 1 do not change the shape of feature maps. However, the subsequent pooling layer directly reduces the size of the feature map by half. Because \(1\times 2+(3-1)+(3-1)=6\), each element in the output feature map has a receptive field on the input feature map of the shape \(6\times6\). As you can see, the height and width downsample block enlarges the receptive field of each element in the output feature map.

In [6]:
def down_sample_blk(num_channels):
    blk = nn.Sequential()
    for _ in range(2):
        blk.add(nn.Conv2D(num_channels, kernel_size=3, padding=1),
    return blk

By testing forward computation in the height and width downsample block, we can see that it changes the number of input channels and halves the height and width.

In [7]:
forward(nd.zeros((2, 3, 20, 20)), down_sample_blk(10)).shape
(2, 10, 10, 10)

Base Network Block

The base network block is used to extract features from original images. To simplify the computation, we will construct a small base network. This network consists of three height and width downsample blocks connected in a series, so it doubles the number of channels at each step. When we input an original image with the shape \(256\times256\), the base network block outputs a feature map with the shape \(32 \times 32\).

In [8]:
def base_net():
    blk = nn.Sequential()
    for num_filters in [16, 32, 64]:
    return blk

forward(nd.zeros((2, 3, 256, 256)), base_net()).shape
(2, 64, 32, 32)

The Complete Model

The SSD model contains a total of five modules. Each module outputs a feature map used to generate anchor boxes and predict the categories and offsets of these anchor boxes. The first module is the base network block, modules two to four are height and width downsample blocks, and the fifth module is a global maximum pooling layer that reduces the height and width to 1. Therefore, modules two to five are all multiscale feature blocks shown in Figure 9.4.

In [9]:
def get_blk(i):
    if i == 0:
        blk = base_net()
    elif i == 4:
        blk = nn.GlobalMaxPool2D()
        blk = down_sample_blk(128)
    return blk

Now, we will define the forward computation process for each module. In contrast to the previously-described convolutional neural networks, this module not only returns feature map Y output by convolutional computation, but also the anchor boxes of the current scale generated from Y and their predicted categories and offsets.

In [10]:
def blk_forward(X, blk, size, ratio, cls_predictor, bbox_predictor):
    Y = blk(X)
    anchors = contrib.ndarray.MultiBoxPrior(Y, sizes=size, ratios=ratio)
    cls_preds = cls_predictor(Y)
    bbox_preds = bbox_predictor(Y)
    return (Y, anchors, cls_preds, bbox_preds)

As we mentioned, the closer a multiscale feature block is to the top in Figure 9.4, the larger the objects it detects and the larger the anchor boxes it must generate. Here, we first divide the interval from 0.2 to 1.05 into five equal parts to determine the sizes of smaller anchor boxes at different scales: 0.2, 0.37, 0.54, etc. Then, according to \(\sqrt{0.2 \times 0.37} = 0.272\), \(\sqrt{0.37 \times 0.54} = 0.447\), and similar formulas, we determine the sizes of larger anchor boxes at the different scales.

In [11]:
sizes = [[0.2, 0.272], [0.37, 0.447], [0.54, 0.619], [0.71, 0.79],
         [0.88, 0.961]]
ratios = [[1, 2, 0.5]] * 5
num_anchors = len(sizes[0]) + len(ratios[0]) - 1

Now, we can define the complete model, TinySSD.

In [12]:
class TinySSD(nn.Block):
    def __init__(self, num_classes, **kwargs):
        super(TinySSD, self).__init__(**kwargs)
        self.num_classes = num_classes
        for i in range(5):
            # The assignment statement is self.blk_i = get_blk(i).
            setattr(self, 'blk_%d' % i, get_blk(i))
            setattr(self, 'cls_%d' % i, cls_predictor(num_anchors,
            setattr(self, 'bbox_%d' % i, bbox_predictor(num_anchors))

    def forward(self, X):
        anchors, cls_preds, bbox_preds = [None] * 5, [None] * 5, [None] * 5
        for i in range(5):
            # getattr(self, 'blk_%d' % i) accesses self.blk_i.
            X, anchors[i], cls_preds[i], bbox_preds[i] = blk_forward(
                X, getattr(self, 'blk_%d' % i), sizes[i], ratios[i],
                getattr(self, 'cls_%d' % i), getattr(self, 'bbox_%d' % i))
        # In the reshape function, 0 indicates that the batch size remains unchanged.
        return (nd.concat(*anchors, dim=1),
                    (0, -1, self.num_classes + 1)), concat_preds(bbox_preds))

We now create an SSD model instance and use it to perform forward computation on image mini-batch X, which has a height and width of 256 pixels. As we verified previously, the first module outputs a feature map with the shape \(32 \times 32\). Because modules two to four are height and width downsample blocks, module five is a global pooling layer, and each element in the feature map is used as the center for 4 anchor boxes, a total of \((32^2 + 16^2 + 8^2 + 4^2 + 1)\times 4 = 5444\) anchor boxes are generated for each image at the five scales.

In [13]:
net = TinySSD(num_classes=1)
X = nd.zeros((32, 3, 256, 256))
anchors, cls_preds, bbox_preds = net(X)

print('output anchors:', anchors.shape)
print('output class preds:', cls_preds.shape)
print('output bbox preds:', bbox_preds.shape)
output anchors: (1, 5444, 4)
output class preds: (32, 5444, 2)
output bbox preds: (32, 21776)


Now, we will explain, step by step, how to train the SSD model for object detection.

Data Reading and Initialization

We read the Pikachu data set we created in the previous section.

In [14]:
batch_size = 32
train_data, test_data = gb.load_data_pikachu(batch_size)
# To ensure efficient GPU computation, we pad each training image with two bounding boxes labeled -1.
train_data.reshape(label_shape=(3, 5))
Downloading ../data/pikachu/train.rec from https://apache-mxnet.s3-accelerate.amazonaws.com/gluon/dataset/pikachu/train.rec...
Downloading ../data/pikachu/train.idx from https://apache-mxnet.s3-accelerate.amazonaws.com/gluon/dataset/pikachu/train.idx...
Downloading ../data/pikachu/val.rec from https://apache-mxnet.s3-accelerate.amazonaws.com/gluon/dataset/pikachu/val.rec...

There is 1 category in the Pikachu data set. After defining the module, we need to initialize the model parameters and define the optimization algorithm.

In [15]:
ctx, net = gb.try_gpu(), TinySSD(num_classes=1)
net.initialize(init=init.Xavier(), ctx=ctx)
trainer = gluon.Trainer(net.collect_params(), 'sgd',
                        {'learning_rate': 0.2, 'wd': 5e-4})

Define Loss and Evaluation Functions

Object detection is subject to two types of losses. The first is anchor box category loss. For this, we can simply reuse the cross-entropy loss function we used in image classification. The second loss is positive anchor box offset loss. Offset prediction is a normalization problem. However, here, we do not use the squared loss introduced previously. Rather, we use the \(L_1\) norm loss, which is the absolute value of the difference between the predicted value and the ground-truth value. The mask variable bbox_masks removes negative anchor boxes and padding anchor boxes from the loss calculation. Finally, we add the anchor box category and offset losses to find the final loss function for the model.

In [16]:
cls_loss = gloss.SoftmaxCrossEntropyLoss()
bbox_loss = gloss.L1Loss()

def calc_loss(cls_preds, cls_labels, bbox_preds, bbox_labels, bbox_masks):
    cls = cls_loss(cls_preds, cls_labels)
    bbox = bbox_loss(bbox_preds * bbox_masks, bbox_labels * bbox_masks)
    return cls + bbox

We can use the accuracy rate to evaluate the classification results. As we use the \(L_1\) norm loss, we will use the average absolute error to evaluate the bounding box prediction results.

In [17]:
def cls_eval(cls_preds, cls_labels):
    # Because the category prediction results are placed in the final dimension, argmax must specify this dimension.
    return (cls_preds.argmax(axis=-1) == cls_labels).mean().asscalar()

def bbox_eval(bbox_preds, bbox_labels, bbox_masks):
    return ((bbox_labels - bbox_preds) * bbox_masks).abs().mean().asscalar()

Train the Model

During model training, we must generate multiscale anchor boxes (anchors) in the model’s forward computation process and predict the category (cls_preds) and offset (bbox_preds) for each anchor box. Afterwards, we label the category (cls_labels) and offset (bbox_labels) of each generated anchor box based on the label information Y. Finally, we calculate the loss function using the predicted and labeled category and offset values. To simplify the code, we do not evaluate the training data set here.

In [18]:
for epoch in range(20):
    acc, mae = 0, 0
    train_data.reset()  # Read data from the start.
    start = time.time()
    for i, batch in enumerate(train_data):
        X = batch.data[0].as_in_context(ctx)
        Y = batch.label[0].as_in_context(ctx)
        with autograd.record():
            # Generate multiscale anchor boxes and predict the category and offset of each.
            anchors, cls_preds, bbox_preds = net(X)
            # Label the category and offset of each anchor box.
            bbox_labels, bbox_masks, cls_labels = contrib.nd.MultiBoxTarget(
                anchors, Y, cls_preds.transpose((0, 2, 1)))
            # Calculate the loss function using the predicted and labeled category and offset values.
            l = calc_loss(cls_preds, cls_labels, bbox_preds, bbox_labels,
        acc += cls_eval(cls_preds, cls_labels)
        mae += bbox_eval(bbox_preds, bbox_labels, bbox_masks)
    if (epoch + 1) % 5 == 0:
        print('epoch %2d, class err %.2e, bbox mae %.2e, time %.1f sec' % (
            epoch + 1, 1 - acc / (i + 1), mae / (i + 1), time.time() - start))
epoch  5, class err 2.91e-03, bbox mae 3.09e-03, time 11.6 sec
epoch 10, class err 2.52e-03, bbox mae 2.79e-03, time 11.8 sec
epoch 15, class err 2.37e-03, bbox mae 2.63e-03, time 11.7 sec
epoch 20, class err 2.29e-03, bbox mae 2.59e-03, time 11.6 sec


In the prediction stage, we want to detect all objects of interest in the image. Below, we read the test image and transform its size. Then, we convert it to the four-dimensional format required by the convolutional layer.

In [19]:
img = image.imread('../img/pikachu.jpg')
feature = image.imresize(img, 256, 256).astype('float32')
X = feature.transpose((2, 0, 1)).expand_dims(axis=0)

Using the MultiBoxDetection function, we predict the bounding boxes based on the anchor boxes and their predicted offsets. Then, we use non-maximum suppression to remove similar bounding boxes.

In [20]:
def predict(X):
    anchors, cls_preds, bbox_preds = net(X.as_in_context(ctx))
    cls_probs = cls_preds.softmax().transpose((0, 2, 1))
    output = contrib.nd.MultiBoxDetection(cls_probs, bbox_preds, anchors)
    idx = [i for i, row in enumerate(output[0]) if row[0].asscalar() != -1]
    return output[0, idx]

output = predict(X)

Finally, we take all the bounding boxes with a confidence level of at least 0.3 and display them as the final output.

In [21]:
gb.set_figsize((5, 5))

def display(img, output, threshold):
    fig = gb.plt.imshow(img.asnumpy())
    for row in output:
        score = row[1].asscalar()
        if score < threshold:
        h, w = img.shape[0:2]
        bbox = [row[2:6] * nd.array((w, h, w, h), ctx=row.context)]
        gb.show_bboxes(fig.axes, bbox, '%.2f' % score, 'w')

display(img, output, threshold=0.3)


  • SSD is a multiscale object detection model. This model generates different numbers of anchor boxes of different sizes based on the base network block and each multiscale feature block and predicts the categories and offsets of the anchor boxes to detect objects of different sizes.
  • During SSD model training, the loss function is calculated using the predicted and labeled category and offset values.


  • Due to space limitations, we have ignored some of the implementation details of SSD models in this experiment. Can you further improve the model in the following areas?

Loss Function

For the predicted offsets, replace \(L_1\) norm loss with \(L_1\) regularization loss. This loss function uses a square function around zero for greater smoothness. This is the regularized area controlled by the hyper-parameter \(\sigma\):

\[\begin{split}f(x) = \begin{cases} (\sigma x)^2/2,& \text{if }|x| < 1/\sigma^2\\ |x|-0.5/\sigma^2,& \text{otherwise} \end{cases}\end{split}\]

When \(\sigma\) is large, this loss is similar to the \(L_1\) norm loss. When the value is small, the loss function is smoother.

In [22]:
sigmas = [10, 1, 0.5]
lines = ['-', '--', '-.']
x = nd.arange(-2, 2, 0.1)

for l, s in zip(lines, sigmas):
    y = nd.smooth_l1(x, scalar=s)
    gb.plt.plot(x.asnumpy(), y.asnumpy(), l, label='sigma=%.1f' % s)

In the experiment, we used cross-entropy loss for category prediction. Now, assume that the prediction probability of the actual category \(j\) is \(p_j\) and the cross-entropy loss is \(-\log p_j\). We can also use the focal loss[2]. Given the positive hyper-parameters \(\gamma\) and \(\alpha\), this loss is defined as:

\[- \alpha (1-p_j)^{\gamma} \log p_j.\]

As you can see, by increasing \(\gamma\), we can effectively reduce the loss when the probability of predicting the correct category is high.

In [23]:
def focal_loss(gamma, x):
    return -(1 - x) ** gamma * x.log()

x = nd.arange(0.01, 1, 0.01)
for l, gamma in zip(lines, [0, 1, 5]):
    y = gb.plt.plot(x.asnumpy(), focal_loss(gamma, x).asnumpy(), l,
                    label='gamma=%.1f' % gamma)

Training and Prediction

  • When an object is relatively large compared to the image, the model normally adopts a larger input image size.
  • This generally produces a large number of negative anchor boxes when labeling anchor box categories. We can sample the negative anchor boxes to better balance the data categories. To do this, we can set the MultiBoxTarget function’s negative_mining_ratio parameter.
  • Assign hyper-parameters with different weights to the anchor box category loss and positive anchor box offset loss in the loss function.
  • Refer to the SSD paper. What methods can be used to evaluate the precision of object detection models[1]?


[1] Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C. Y., & Berg, A. C. (2016, October). Ssd: Single shot multibox detector. In European conference on computer vision (pp. 21-37). Springer, Cham.

[2] Lin, T. Y., Goyal, P., Girshick, R., He, K., & Doll á r, P. (2018). Focal loss for dense object detection. IEEE transactions on pattern analysis and machine intelligence.

Discuss on our Forum