Dive into Deep Learning
Table Of Contents
Dive into Deep Learning
Table Of Contents


In the previous chapter, we introduced one classical approach to regularize statistical models. We penalized the size (the \(\ell_2\) norm) of the weights, coercing them to take smaller values. In probabilistic terms we might say that this imposes a Gaussian prior on the value of the weights. But in more intuitive, functional terms, we can say that this encourages the model to spread out its weights among many features and not to depend too much on a small number of potentially spurious associations.

Overfitting Revisited

With great flexibility comes overfitting liability. Given many more features than examples, linear models can overfit. But when there are many more examples than features, linear models can usually be counted on not to overfit. Unfortunately this propensity to generalize well comes at a cost. For every feature, a linear model has to assign it either positive or negative weight. Linear models can ’ t take into account nuanced interactions between features. In more formal texts, you ’ ll see this phenomena discussed as the bias-variance tradeoff. Linear models have high bias, (they can only represent a small class of functions), but low variance (they give similar results across different random samples of the data).

Deep neural networks, however, occupy the opposite end of the bias-variance spectrum. Neural networks are so flexible because they aren ’ t confined to looking at each feature individually. Instead, they can learn complex interactions among groups of features. For example, they might infer that “Nigeria” and “Western Union” appearing together in an email indicates spam but that “Nigeria” without “Western Union” does not connote spam.

Even for a small number of features, deep neural networks are capable of overfitting. As one demonstration of the incredible flexibility of neural networks, researchers showed that neural networks perfectly classify randomly labeled data. Let ’ s think about what means. If the labels are assigned uniformly at random, and there are 10 classes, then no classifier can get better than 10% accuracy on holdout data. Yet even in these situations, when there is no true pattern to be learned, neural networks can perfectly fit the training labels.

Robustness through Perturbations

Let’s think briefly about what we expect from a good statistical model. Obviously we want it to do well on unseen test data. One way we can accomplish this is by asking for what amounts to a ‘simple’ model. Simplicity can come in the form of a small number of dimensions, which is what we did when discussing fitting a function with monomial basis functions. Simplicity can also come in the form of a small norm for the basis funtions. This is what led to weight decay and \(\ell_2\) regularization. Yet a third way to impose some notion of simplicity is that the function should be robust under modest changes in the input. For instance, when we classify images, we would expect that alterations of a few pixels are mostly harmless.

In fact, this notion was formalized by Bishop in 1995, when he proved that Training with Input Noise is Equivalent to Tikhonov Regularization. That is, he connected the notion of having a smooth (and thus simple) function with one that is resilient to perturbations in the input. Fast forward to 2014. Given the complexity of deep networks with many layers, enforcing smoothness just on the input misses out on what is happening in subsequent layers. The ingenious idea of Srivastava et al., 2014 was to apply Bishop’s idea to the internal layers of the network, too, namely to inject noise into the computational path of the network while it’s training.

A key challenge in this context is how to add noise without introducing undue bias. In terms of inputs \(\mathbf{x}\), this is relatively easy to accomplish: simply add some noise \(\epsilon \sim \mathcal{N}(0,\sigma^2)\) to it and use this data during training via \(\mathbf{x}' = \mathbf{x} + \epsilon\). A key property is that in expectation \(\mathbf{E}[\mathbf{x}'] = \mathbf{x}\). For intermediate layers, though, this might not be quite so desirable since the scale of the noise might not be appropriate. The alternative is to perturb coordinates as follows:

\[\begin{split}\begin{aligned} h' = \begin{cases} 0 & \text{ with probability } p \\ \frac{h}{1-p} & \text{ otherwise} \end{cases} \end{aligned}\end{split}\]

By design, the expectation remains unchanged, i.e. \(\mathbf{E}[h'] = h\). This idea is at the heart of dropout where intermediate activations \(h\) are replaced by a random variable \(h'\) with matching expectation. The name ‘dropout’ arises from the notion that some neurons ‘drop out’ of the computation for the purpose of computing the final result. During training we replace intermediate activations with random variables

Dropout in Practice

Recall the multilayer perceptron with a hidden layer and 5 hidden units. Its architecture is given by

\[\begin{split}\begin{aligned} h & = \sigma(W_1 x + b_1) \\ o & = W_2 h + b_2 \\ \hat{y} & = \mathrm{softmax}(o) \end{aligned}\end{split}\]

When we apply dropout to the hidden layer, it amounts to removing hidden units with probability \(p\) since their output is set to \(0\) with that probability. A possible result is the network shown below. Here \(h_2\) and \(h_5\) are removed. Consequently the calculation of \(y\) no longer depends on \(h_2\) and \(h_5\) and their respective gradient also vanishes when performing backprop. In this way, the calculation of the output layer cannot be overly dependent on any one element of \(h_1, \ldots, h_5\). This is exactly what we want for regularization purposes to cope with overfitting. At test time we typically do not use dropout to obtain more conclusive results.



Dropout from Scratch

To implement the dropout function we have to draw as many random variables as the input has dimensions from the uniform distribution \(U[0,1]\). According to the definition of dropout, we can implement it easily. The following dropout function will drop out the elements in the NDArray input X with the probability of drop_prob.

In [1]:
import gluonbook as gb
from mxnet import autograd, gluon, init, nd
from mxnet.gluon import loss as gloss, nn

def dropout(X, drop_prob):
    assert 0 <= drop_prob <= 1
    # In this case, all elements are dropped out.
    if drop_prob == 1:
        return X.zeros_like()
    mask = nd.random.uniform(0, 1, X.shape) > drop_prob
    return mask * X / (1.0-drop_prob)

Let us test how it works in a few examples. The dropout probability is 0, 0.5, and 1, respectively.

In [2]:
X = nd.arange(16).reshape((2, 8))
print(dropout(X, 0))
print(dropout(X, 0.5))
print(dropout(X, 1))

[[ 0.  1.  2.  3.  4.  5.  6.  7.]
 [ 8.  9. 10. 11. 12. 13. 14. 15.]]
<NDArray 2x8 @cpu(0)>

[[ 0.  0.  0.  0.  8. 10. 12.  0.]
 [16.  0. 20. 22.  0.  0.  0. 30.]]
<NDArray 2x8 @cpu(0)>

[[0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0.]]
<NDArray 2x8 @cpu(0)>

Defining Model Parameters

Let’s use the same dataset as used previously, namely Fashion-MNIST, described in the section “Softmax Regression - Starting From Scratch”. We will define a multilayer perceptron with two hidden layers. The two hidden layers both have 256 outputs.

In [3]:
num_inputs, num_outputs, num_hiddens1, num_hiddens2 = 784, 10, 256, 256

W1 = nd.random.normal(scale=0.01, shape=(num_inputs, num_hiddens1))
b1 = nd.zeros(num_hiddens1)
W2 = nd.random.normal(scale=0.01, shape=(num_hiddens1, num_hiddens2))
b2 = nd.zeros(num_hiddens2)
W3 = nd.random.normal(scale=0.01, shape=(num_hiddens2, num_outputs))
b3 = nd.zeros(num_outputs)

params = [W1, b1, W2, b2, W3, b3]
for param in params:

Define the Model

The model defined below concatenates the fully connected layer and the activation function ReLU, using dropout for the output of each activation function. We can set the dropout probability of each layer separately. It is generally recommended to set a lower dropout probability closer to the input layer. Below we set it to 0.2 and 0.5 for the first and second hidden layer respectively. By using the is_training function described in the “Autograd” section we can ensure that dropout is only active during training.

In [4]:
drop_prob1, drop_prob2 = 0.2, 0.5

def net(X):
    X = X.reshape((-1, num_inputs))
    H1 = (nd.dot(X, W1) + b1).relu()
    if autograd.is_training():        # Use dropout only when training the model.
        H1 = dropout(H1, drop_prob1)  # Add a dropout layer after the first fully connected layer.
    H2 = (nd.dot(H1, W2) + b2).relu()
    if autograd.is_training():
        H2 = dropout(H2, drop_prob2)  # Add a dropout layer after the second fully connected layer.
    return nd.dot(H2, W3) + b3

Training and Testing

This is similar to the training and testing of multilayer perceptrons described previously.

In [5]:
num_epochs, lr, batch_size = 10, 0.5, 256
loss = gloss.SoftmaxCrossEntropyLoss()
train_iter, test_iter = gb.load_data_fashion_mnist(batch_size)
gb.train_ch3(net, train_iter, test_iter, loss, num_epochs, batch_size, params,
epoch 1, loss 1.1351, train acc 0.558, test acc 0.789
epoch 2, loss 0.5803, train acc 0.785, test acc 0.830
epoch 3, loss 0.4939, train acc 0.821, test acc 0.850
epoch 4, loss 0.4439, train acc 0.838, test acc 0.859
epoch 5, loss 0.4228, train acc 0.846, test acc 0.862
epoch 6, loss 0.3985, train acc 0.855, test acc 0.867
epoch 7, loss 0.3863, train acc 0.859, test acc 0.871
epoch 8, loss 0.3706, train acc 0.865, test acc 0.871
epoch 9, loss 0.3583, train acc 0.869, test acc 0.871
epoch 10, loss 0.3583, train acc 0.869, test acc 0.875

Dropout in Gluon

In Gluon, we only need to add the Dropout layer after the fully connected layer and specify the dropout probability. When training the model, the Dropout layer will randomly drop out the output elements of the previous layer at the specified dropout probability; the Dropout layer simply passes the data through during testin.

In [6]:
net = nn.Sequential()
net.add(nn.Dense(256, activation="relu"),
        nn.Dropout(drop_prob1),  # Add a dropout layer after the first fully connected layer.
        nn.Dense(256, activation="relu"),
        nn.Dropout(drop_prob2),  # Add a dropout layer after the second fully connected layer.

Next, we will train and test the model.

In [7]:
trainer = gluon.Trainer(net.collect_params(), 'sgd', {'learning_rate': lr})
gb.train_ch3(net, train_iter, test_iter, loss, num_epochs, batch_size,
             None, None, trainer)
epoch 1, loss 1.2070, train acc 0.534, test acc 0.769
epoch 2, loss 0.5924, train acc 0.781, test acc 0.829
epoch 3, loss 0.4975, train acc 0.819, test acc 0.851
epoch 4, loss 0.4515, train acc 0.836, test acc 0.860
epoch 5, loss 0.4223, train acc 0.847, test acc 0.862
epoch 6, loss 0.4009, train acc 0.855, test acc 0.868
epoch 7, loss 0.3922, train acc 0.858, test acc 0.855
epoch 8, loss 0.3703, train acc 0.866, test acc 0.870
epoch 9, loss 0.3593, train acc 0.868, test acc 0.873
epoch 10, loss 0.3489, train acc 0.872, test acc 0.880


  • Beyond controlling the number of dimensions and the size of the weight vector, dropout is yet another tool to avoid overfitting. Often all three are used jointly.
  • Dropout replaces an activation \(h\) with a random variable \(h'\) with expected value \(h\) and with variance given by the dropout probability \(p\).
  • Dropout is only used during training.


  1. Try out what happens if you change the dropout probabilities for layers 1 and 2. In particular, what happens if you switch the ones for both layers?
  2. Increase the number of epochs and compare the results obtained when using dropout with those when not using it.
  3. Compute the variance of the the activation random variables after applying dropout.
  4. Why should you typically not using dropout?
  5. If changes are made to the model to make it more complex, such as adding hidden layer units, will the effect of using dropout to cope with overfitting be more obvious?
  6. Using the model in this section as an example, compare the effects of using dropout and weight decay. What if dropout and weight decay are used at the same time?
  7. What happens if we apply dropout to the individual weights of the weight matrix rather than the activations?
  8. Replace the dropout activation with a random variable that takes on values of \([0, \gamma/2, \gamma]\). Can you design something that works better than the binary dropout function? Why might you want to use it? Why not?


[1] Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. (2014). JMLR

Discuss on our Forum