Dive into Deep Learning
Table Of Contents
Dive into Deep Learning
Table Of Contents

Implementation of a Recurrent Neural Network from Scratch

@TODO(smolix/astonzhang): the data set was just changed from lyrics to time machine, so descriptions/hyperparameters have to change.

In this section, we will implement a language model based on a character-level recurrent neural network from scratch and train the model on the Jay Chou album lyrics data set to teach it to write lyrics. First, we read the Jay Chou album lyrics data set.

In [1]:
import sys
sys.path.insert(0, '..')

import gluonbook as gb
import math
from mxnet import autograd, nd
from mxnet.gluon import loss as gloss
import time

(corpus_indices, char_to_idx, idx_to_char,
 vocab_size) = gb.load_data_time_machine()

One-hot Vector

One-hot vectors provide an easy way to express words as vectors in order to input them in the neural network. Assume the number of different characters in the dictionary is \(N\) (the vocab_size) and each character has a one-to-one correspondence with a single value in the index of successive integers from 0 to \(N-1\). If the index of a character is the integer \(i\), then we create a vector of all 0s with a length of \(N\) and set the element at position \(i\) to 1. This vector is the one-hot vector of the original character. The one-hot vectors with indices 0 and 2 are shown below, and the length of the vector is equal to the dictionary size.

In [2]:
nd.one_hot(nd.array([0, 2]), vocab_size)
Out[2]:

[[1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]]
<NDArray 2x43 @cpu(0)>

The shape of the mini-batch we sample each time is (batch size, time step). The following function transforms such mini-batches into a number of matrices with the shape of (batch size, dictionary size) that can be entered into the network. The total number of vectors is equal to the number of time steps. That is, the input of time step \(t\) is \(\boldsymbol{X}_t \in \mathbb{R}^{n \times d}\), where \(n\) is the batch size and \(d\) is the number of inputs. That is the one-hot vector length (the dictionary size).

In [3]:
def to_onehot(X, size):  # This function is saved in the gluonbook package for future use.
    return [nd.one_hot(x, size) for x in X.T]

X = nd.arange(10).reshape((2, 5))
inputs = to_onehot(X, vocab_size)
len(inputs), inputs[0].shape
Out[3]:
(5, (2, 43))

Initialize Model Parameters

Next, we initialize the model parameters. The number of hidden units num_hiddens is a hyper-parameter.

In [4]:
num_inputs, num_hiddens, num_outputs = vocab_size, 256, vocab_size
ctx = gb.try_gpu()
print('will use', ctx)

def get_params():
    def _one(shape):
        return nd.random.normal(scale=0.01, shape=shape, ctx=ctx)

    # Hidden layer parameters
    W_xh = _one((num_inputs, num_hiddens))
    W_hh = _one((num_hiddens, num_hiddens))
    b_h = nd.zeros(num_hiddens, ctx=ctx)
    # Output layer parameters
    W_hq = _one((num_hiddens, num_outputs))
    b_q = nd.zeros(num_outputs, ctx=ctx)
    # Attach a gradient
    params = [W_xh, W_hh, b_h, W_hq, b_q]
    for param in params:
        param.attach_grad()
    return params
will use gpu(0)

Define the Model

We implement this model based on the computational expressions of the recurrent neural network. First, we define the init_rnn_state function to return the hidden state at initialization. It returns a tuple consisting of an NDArray with a value of 0 and a shape of (batch size, number of hidden units). Using tuples makes it easier to handle situations where the hidden state contains multiple NDArrays.

In [5]:
def init_rnn_state(batch_size, num_hiddens, ctx):
    return (nd.zeros(shape=(batch_size, num_hiddens), ctx=ctx), )

The following rnn function defines how to compute the hidden state and output in a time step. The activation function here uses the tanh function. As described in the “Multilayer Perceptron” section, the mean value of tanh function values is 0 when the elements are evenly distributed over the real number field.

In [6]:
def rnn(inputs, state, params):
    # Both inputs and outputs are composed of num_steps matrices of the shape (batch_size, vocab_size).
    W_xh, W_hh, b_h, W_hq, b_q = params
    H, = state
    outputs = []
    for X in inputs:
        H = nd.tanh(nd.dot(X, W_xh) + nd.dot(H, W_hh) + b_h)
        Y = nd.dot(H, W_hq) + b_q
        outputs.append(Y)
    return outputs, (H,)

Do a simple test to observe the number of output results (number of time steps), as well as the output layer output shape and hidden state shape of the first time step.

In [7]:
state = init_rnn_state(X.shape[0], num_hiddens, ctx)
inputs = to_onehot(X.as_in_context(ctx), vocab_size)
params = get_params()
outputs, state_new = rnn(inputs, state, params)
len(outputs), outputs[0].shape, state_new[0].shape
Out[7]:
(5, (2, 43), (2, 256))

Define the Prediction Function

The following function predicts the next num_chars characters based on the prefix (a string containing several characters). This function is a bit more complicated. In it, we set the recurrent neural unit rnn as a function parameter, so that this function can be reused in the other recurrent neural networks described in following sections.

In [8]:
# This function is saved in the gluonbook package for future use.
def predict_rnn(prefix, num_chars, rnn, params, init_rnn_state,
                num_hiddens, vocab_size, ctx, idx_to_char, char_to_idx):
    state = init_rnn_state(1, num_hiddens, ctx)
    output = [char_to_idx[prefix[0]]]
    for t in range(num_chars + len(prefix) - 1):
        # The output of the previous time step is taken as the input of the current time step.
        X = to_onehot(nd.array([output[-1]], ctx=ctx), vocab_size)
        # Calculate the output and update the hidden state.
        (Y, state) = rnn(X, state, params)
        # The input to the next time step is the character in the prefix or the current best predicted character.
        if t < len(prefix) - 1:
            output.append(char_to_idx[prefix[t + 1]])
        else:
            output.append(int(Y[0].argmax(axis=1).asscalar()))
    return ''.join([idx_to_char[i] for i in output])

We test the predict_rnn function first. We will create a lyric with a length of 10 characters (regardless of the prefix length) based on the prefix “separate”. Because the model parameters are random values, the prediction results are also random.

In [9]:
predict_rnn('traveller', 10, rnn, params, init_rnn_state, num_hiddens, vocab_size,
            ctx, idx_to_char, char_to_idx)
Out[9]:
'travellergqan1pm!xs'

Clip Gradients

Gradient vanishing or explosion is more likely to occur in recurrent neural networks. We will explain the reason in subsequent sections of this chapter. In order to deal with gradient explosion, we can clip the gradient. Assume we concatenate the elements of all model parameter gradients into a vector \(\boldsymbol{g}\) and set the clipping threshold to \(\theta\). In the clipped gradient:

\[\min\left(\frac{\theta}{\|\boldsymbol{g}\|}, 1\right)\boldsymbol{g}\]

the \(L_2\) norm does not exceed \(\theta\).

In [10]:
# This function is saved in the gluonbook package for future use.
def grad_clipping(params, theta, ctx):
    norm = nd.array([0.0], ctx)
    for param in params:
        norm += (param.grad ** 2).sum()
    norm = norm.sqrt().asscalar()
    if norm > theta:
        for param in params:
            param.grad[:] *= theta / norm

Perplexity

We generally use perplexity to evaluate the quality of a language model. Recall the definition of the cross entropy loss function in the “Softmax Regression” section. Perplexity is the value obtained by exponentially computing the cross entropy loss function. In particular:

  • In the best case scenario, the model always predicts the probability of the label category as 1. In this situation, the perplexity is 1.
  • In the worst case scenario, the model always predicts the probability of the label category as 0. In this situation, the perplexity is positive infinity.
  • At the baseline, the model always predicts the same probability for all categories. In this situation, the perplexity is the number of categories.

Obviously, the perplexity of any valid model must be less than the number of categories. In this case, the perplexity must be less than the dictionary size vocab_size.

Define Model Training Functions

Compared with the model training functions of the previous chapters, the model training functions here are different in the following ways:

  1. We use perplexity to evaluate the model.
  2. We clip the gradient before updating the model parameters.
  3. Different sampling methods for timing data will result in differences in the initialization of hidden states. For a discussion of these issues, please refer to the “Language Model Data Set (Jay Chou Album Lyrics)” section.

In addition, considering the other recurrent neural networks that will be described later, the function implementations here are longer, so as to be more general.

In [11]:
# This function is saved in the gluonbook package for future use.
def train_and_predict_rnn(rnn, get_params, init_rnn_state, num_hiddens,
                          vocab_size, ctx, corpus_indices, idx_to_char,
                          char_to_idx, is_random_iter, num_epochs, num_steps,
                          lr, clipping_theta, batch_size, pred_period,
                          pred_len, prefixes):
    if is_random_iter:
        data_iter_fn = gb.data_iter_random
    else:
        data_iter_fn = gb.data_iter_consecutive
    params = get_params()
    loss = gloss.SoftmaxCrossEntropyLoss()

    for epoch in range(num_epochs):
        if not is_random_iter:  # If adjacent sampling is used, the hidden state is initialized at the beginning of the epoch.
            state = init_rnn_state(batch_size, num_hiddens, ctx)
        loss_sum, start = 0.0, time.time()
        data_iter = data_iter_fn(corpus_indices, batch_size, num_steps, ctx)
        for t, (X, Y) in enumerate(data_iter):
            if is_random_iter:  # If random sampling is used, the hidden state is initialized before each mini-batch update.
                state = init_rnn_state(batch_size, num_hiddens, ctx)
            else:  # Otherwise, the detach function needs to be used to separate the hidden state from the computational graph.
                for s in state:
                    s.detach()
            with autograd.record():
                inputs = to_onehot(X, vocab_size)
                # outputs has num_steps matrices of the shape (batch_size, vocab_size).
                (outputs, state) = rnn(inputs, state, params)
                # The shape after stitching is (num_steps * batch_size, vocab_size).
                outputs = nd.concat(*outputs, dim=0)
                # The shape of Y is (batch_size, num_steps), and then becomes a vector with a length of
                # batch * num_steps after transposition. This gives it a one-to-one correspondence with output rows.
                y = Y.T.reshape((-1,))
                # The average classification error is calculated using cross entropy loss.
                l = loss(outputs, y).mean()
            l.backward()
            grad_clipping(params, clipping_theta, ctx)  # Clip the gradient.
            gb.sgd(params, lr, 1)  # Since the error has already taken the mean, the gradient does not need to be averaged.
            loss_sum += l.asscalar()

        if (epoch + 1) % pred_period == 0:
            print('epoch %d, perplexity %f, time %.2f sec' % (
                epoch + 1, math.exp(loss_sum / (t + 1)), time.time() - start))
            for prefix in prefixes:
                print(' -', predict_rnn(
                    prefix, pred_len, rnn, params, init_rnn_state,
                    num_hiddens, vocab_size, ctx, idx_to_char, char_to_idx))

Train the Model and Write Lyrics

Now we can train the model. First, set the model hyper-parameter. We will create a lyrics segment with a length of 50 characters (regardless of the prefix length) respectively based on the prefixes “separate” and “not separated”. We create a lyrics segment based on the currently trained model every 50 epochs.

In [12]:
num_epochs, num_steps, batch_size, lr, clipping_theta = 200, 35, 32, 1e2, 1e-2
pred_period, pred_len, prefixes = 50, 50, ['traveller', 'time traveller']

Next, we use random sampling to train the model and write lyrics.

In [13]:
train_and_predict_rnn(rnn, get_params, init_rnn_state, num_hiddens,
                      vocab_size, ctx, corpus_indices, idx_to_char,
                      char_to_idx, True, num_epochs, num_steps, lr,
                      clipping_theta, batch_size, pred_period, pred_len,
                      prefixes)
epoch 50, perplexity 7.389541, time 0.24 sec
 - traveller son the timensions of the timensions of the timen
 - time traveller and the precting the timensions of the timensions
epoch 100, perplexity 3.877981, time 0.24 sec
 - traveller.  'in the experthat space at all of the trimens o
 - time traveller.  'in the thing to is in the treve langly rear so
epoch 150, perplexity 2.094382, time 0.24 sec
 - traveller pwole that our consciousness on you are wrome. ar
 - time traveller pforeess,  nit we merace to you and mangen that w
epoch 200, perplexity 1.579459, time 0.24 sec
 - traveller.  'it waing at one dimensions of space ixca to a
 - time traveller.  'it would has at righors pastent mour more orav

Then, we use adjacent sampling to train the model and write lyrics.

In [14]:
train_and_predict_rnn(rnn, get_params, init_rnn_state, num_hiddens,
                      vocab_size, ctx, corpus_indices, idx_to_char,
                      char_to_idx, False, num_epochs, num_steps, lr,
                      clipping_theta, batch_size, pred_period, pred_len,
                      prefixes)
epoch 50, perplexity 7.066477, time 0.24 sec
 - traveller andithe bat our couls of the the thallere mant of
 - time traveller andithe bat our couls of the the thallere mant of
epoch 100, perplexity 3.418975, time 0.24 sec
 - traveller.  'you canter at is menttaly of the dithe far it
 - time traveller.  'you canter at is menttaly of the dithe far it
epoch 150, perplexity 1.970121, time 0.24 sec
 - traveller.  'suracter forthe this seist to the time travell
 - time traveller.  'y tal gerther shimental  'ou to the fime trave
epoch 200, perplexity 1.517633, time 0.24 sec
 - traveller.  'avk said filby, 'but you weand no the time tra
 - time traveller (for scesengot,' said the psychologist.  'now, ha

Summary

  • We can apply a language model based on a character-level recurrent neural network to generate sequences, such as writing lyrics.
  • When training a recurrent neural network, we can clip the gradient to cope with gradient explosion.
  • Perplexity is the value obtained by exponentially computing the cross entropy loss function.

Problems

  • Adjust the hyper-parameters and observe and analyze the impact on running time, perplexity, and the written lyrics.
  • Run the code in this section without clipping the gradient. What happens?
  • Set the pred_period variable to 1 to observe how the under-trained model (high perplexity) writes lyrics. What can you learn from this?
  • Change adjacent sampling so that it does not separate hidden states from the computational graph. Does the running time change?
  • Replace the activation function used in this section with ReLU and repeat the experiments in this section.

Discuss on our Forum