# Language Model Data Sets (Jay Chou Album Lyrics)¶

This section describes how to preprocess a language model data set and convert it to the input format required for a character-level recurrent neural network. To this end, we collected Jay Chou’s lyrics from his first album “Jay” to his tenth album “The Era”. In subsequent chapters, we will a recurrent neural network to train a language model on this data set. Once the model is trained, we can use it to write lyrics.

## Read the Data Sets¶

First, read this data set and see what the first 40 characters look like.

```
In [1]:
```

```
from mxnet import nd
import random
import zipfile
with zipfile.ZipFile('../data/jaychou_lyrics.txt.zip') as zin:
with zin.open('jaychou_lyrics.txt') as f:
corpus_chars = f.read().decode('utf-8')
corpus_chars[:40]
```

```
Out[1]:
```

```
'想要有直升机\n想要和你飞到宇宙去\n想要和你融化在一起\n融化在宇宙里\n我每天每天每'
```

This data set has more than 50,000 characters. For ease of printing, we replaced line breaks with spaces and then used only the first 10,000 characters to train the model.

```
In [2]:
```

```
corpus_chars = corpus_chars.replace('\n', ' ').replace('\r', ' ')
corpus_chars = corpus_chars[0:10000]
```

## Establish a Character Index¶

We map each character to continuous integers starting from 0, also known
as an index, to facilitate subsequent data processing. To get the index,
we extract all the different characters in the data set and then map
them to the index one by one to construct the dictionary. Then, print
`vocab_size`

, which is the number of different characters in the
dictionary, i.e. the dictionary size.

```
In [3]:
```

```
idx_to_char = list(set(corpus_chars))
char_to_idx = dict([(char, i) for i, char in enumerate(idx_to_char)])
vocab_size = len(char_to_idx)
vocab_size
```

```
Out[3]:
```

```
1027
```

After that, each character in the training data set is converted into an index, and the first 20 characters and their corresponding indexes are printed.

```
In [4]:
```

```
corpus_indices = [char_to_idx[char] for char in corpus_chars]
sample = corpus_indices[:20]
print('chars:', ''.join([idx_to_char[idx] for idx in sample]))
print('indices:', sample)
```

```
chars: 想要有直升机 想要和你飞到宇宙去 想要和
indices: [495, 181, 255, 581, 966, 284, 683, 495, 181, 930, 622, 831, 556, 448, 468, 494, 683, 495, 181, 930]
```

We packaged the above code in the `load_data_jay_lyrics`

function of
the `gluonbook`

package for to facilitate calling in later chapters.
After calling this function, we will get four variables in turn,
`corpus_indices`

, `char_to_idx`

, `idx_to_char`

, and
`vocab_size`

.

## Time Series Data Sampling¶

During training, we need to randomly read mini-batches of examples and labels each time. Unlike the experimental data from the previous chapter, a timing data instance usually contains consecutive characters. Assume that there are 5 time steps and the example sequence is 5 characters: “I”, “want”, “to”, “have”, “a”. The label sequence of the example is the character that follows these characters in the training set: “want”, “to”, “have”, “a”, “helicopter”. We have two ways to sample timing data, random sampling and adjacent sampling.

### Random sampling¶

The following code randomly samples a mini–batch from the data each
time. Here, the batch size `batch_size`

indicates to the number of
examples in each mini-batch and `num_steps`

is the number of time
steps included in each example. In random sampling, each example is a
sequence arbitrarily captured on the original sequence. The positions of
two adjacent random mini-batches on the original sequence are not
necessarily adjacent. Therefore, we cannot initialize the hidden state
of the next mini-batch with the hidden state of final time step of the
previous mini-batch. When training the model, the hidden state needs to
be reinitialized before each random sampling.

```
In [5]:
```

```
# This function is saved in the gluonbook package for future use.
def data_iter_random(corpus_indices, batch_size, num_steps, ctx=None):
# We subtract one because the index of the output is the index of the corresponding input plus one.
num_examples = (len(corpus_indices) - 1) // num_steps
epoch_size = num_examples // batch_size
example_indices = list(range(num_examples))
random.shuffle(example_indices)
# This returns a sequence of the length num_steps starting from pos.
def _data(pos):
return corpus_indices[pos: pos + num_steps]
for i in range(epoch_size):
# batch_size indicates the random examples read each time.
i = i * batch_size
batch_indices = example_indices[i: i + batch_size]
X = [_data(j * num_steps) for j in batch_indices]
Y = [_data(j * num_steps + 1) for j in batch_indices]
yield nd.array(X, ctx), nd.array(Y, ctx)
```

Let us input an artificial sequence from 0 to 29. We assume the batch
size and numbers of time steps are 2 and 6 respectively. Then we print
input `X`

and label `Y`

for each mini-batch of examples read by
random sampling. As we can see, the positions of two adjacent random
mini-batches on the original sequence are not necessarily adjacent.

```
In [6]:
```

```
my_seq = list(range(30))
for X, Y in data_iter_random(my_seq, batch_size=2, num_steps=6):
print('X: ', X, '\nY:', Y, '\n')
```

```
X:
[[ 0. 1. 2. 3. 4. 5.]
[18. 19. 20. 21. 22. 23.]]
<NDArray 2x6 @cpu(0)>
Y:
[[ 1. 2. 3. 4. 5. 6.]
[19. 20. 21. 22. 23. 24.]]
<NDArray 2x6 @cpu(0)>
X:
[[ 6. 7. 8. 9. 10. 11.]
[12. 13. 14. 15. 16. 17.]]
<NDArray 2x6 @cpu(0)>
Y:
[[ 7. 8. 9. 10. 11. 12.]
[13. 14. 15. 16. 17. 18.]]
<NDArray 2x6 @cpu(0)>
```

### Adjacent sampling¶

In addition to random sampling of the original sequence, we can also make the positions of two adjacent random mini-batches adjacent on the original sequence. Now, we can use a hidden state of the last time step of a mini-batch to initialize the hidden state of the next mini-batch, so that the output of the next mini-batch is also dependent on the input of the mini-batch, with this pattern continuing in subsequent mini-batches. This has two effects on the implementation of recurrent neural network. On the one hand, when training the model, we only need to initialize the hidden state at the beginning of each epoch. On the other hand, when multiple adjacent mini-batches are concatenated by passing hidden states, the gradient calculation of the model parameters will depend on all the mini-batch sequences that are concatenated. In the same epoch as the number of iterations increases, the costs of gradient calculation rise. So that the model parameter gradient calculations only depend on the mini-batch sequence read by one iteration, we can separate the hidden state from the computational graph before reading the mini-batch. We will gain a deeper understand this approach in the following sections.

```
In [7]:
```

```
# This function is saved in the gluonbook package for future use.
def data_iter_consecutive(corpus_indices, batch_size, num_steps, ctx=None):
corpus_indices = nd.array(corpus_indices, ctx=ctx)
data_len = len(corpus_indices)
batch_len = data_len // batch_size
indices = corpus_indices[0: batch_size*batch_len].reshape((
batch_size, batch_len))
epoch_size = (batch_len - 1) // num_steps
for i in range(epoch_size):
i = i * num_steps
X = indices[:, i: i + num_steps]
Y = indices[:, i + 1: i + num_steps + 1]
yield X, Y
```

Using the same settings, print input `X`

and label `Y`

for each
mini-batch of examples read by random sampling. The positions of two
adjacent random mini-batches on the original sequence are adjacent.

```
In [8]:
```

```
for X, Y in data_iter_consecutive(my_seq, batch_size=2, num_steps=6):
print('X: ', X, '\nY:', Y, '\n')
```

```
X:
[[ 0. 1. 2. 3. 4. 5.]
[15. 16. 17. 18. 19. 20.]]
<NDArray 2x6 @cpu(0)>
Y:
[[ 1. 2. 3. 4. 5. 6.]
[16. 17. 18. 19. 20. 21.]]
<NDArray 2x6 @cpu(0)>
X:
[[ 6. 7. 8. 9. 10. 11.]
[21. 22. 23. 24. 25. 26.]]
<NDArray 2x6 @cpu(0)>
Y:
[[ 7. 8. 9. 10. 11. 12.]
[22. 23. 24. 25. 26. 27.]]
<NDArray 2x6 @cpu(0)>
```

## Summary¶

- Timing data sampling methods include random sampling and adjacent sampling. These two methods are implemented slightly differently in recurrent neural network model training.

## Problems¶

- What other mini-batch data sampling methods can you think of?
- If we want a sequence example to be a complete sentence, what kinds of problems does this introduce in mini-batch sampling?