Dive into Deep Learning
Table Of Contents
Dive into Deep Learning
Table Of Contents

Gluon Implementation for Multi-GPU Computation

In Gluon, we can conveniently use data parallelism to perform multi-GPU computation. For example, we do not need to implement the helper function to synchronize data among multiple GPUs, as described in the“Multi-GPU Computation” section, ourselves.

First, import the required packages or modules for the experiment in this section. Running the programs in this section requires at least two GPUs.

In [1]:
import gluonbook as gb
import mxnet as mx
from mxnet import autograd, gluon, init, nd
from mxnet.gluon import loss as gloss, nn, utils as gutils
import time

Initialize Model Parameters on Multiple GPUs

In this section, we use ResNet-18 as a sample model. Since the input images in this section are original size (not enlarged), the model construction here is different from the ResNet-18 structure described in the “ResNet” section. This model uses a smaller convolution kernel, stride, and padding at the beginning and removes the maximum pooling layer.

In [2]:
def resnet18(num_classes):  # This function is saved in the gluonbook package for future use.
    def resnet_block(num_channels, num_residuals, first_block=False):
        blk = nn.Sequential()
        for i in range(num_residuals):
            if i == 0 and not first_block:
                blk.add(gb.Residual(
                    num_channels, use_1x1conv=True, strides=2))
            else:
                blk.add(gb.Residual(num_channels))
        return blk

    net = nn.Sequential()
    # This model uses a smaller convolution kernel, stride, and padding and removes the maximum pooling layer.
    net.add(nn.Conv2D(64, kernel_size=3, strides=1, padding=1),
            nn.BatchNorm(), nn.Activation('relu'))
    net.add(resnet_block(64, 2, first_block=True),
            resnet_block(128, 2),
            resnet_block(256, 2),
            resnet_block(512, 2))
    net.add(nn.GlobalAvgPool2D(), nn.Dense(num_classes))
    return net

net = resnet18(10)

Previously, we discussed how to use the initialize function’s ctx parameter to initialize model parameters on a CPU or a single GPU. In fact, ctx can accept a range of CPUs and GPUs so as to copy initialized model parameters to all CPUs and GPUs in ctx.

In [3]:
ctx = [mx.gpu(0), mx.gpu(1)]
net.initialize(init=init.Normal(sigma=0.01), ctx=ctx)

Gluon provides the split_and_load function implemented in the previous section. It can divide a mini-batch of data instances and copy them to each CPU or GPU. Then, the model computation for the data input to each CPU or GPU occurs on that same CPU or GPU.

In [4]:
x = nd.random.uniform(shape=(4, 1, 28, 28))
gpu_x = gutils.split_and_load(x, ctx)
net(gpu_x[0]), net(gpu_x[1])
Out[4]:
(
 [[ 5.4814936e-06 -8.3371094e-07 -1.6316770e-06 -6.3674099e-07
   -3.8216162e-06 -2.3514044e-06 -2.5469599e-06 -9.4784696e-08
   -6.9033558e-07  2.5756231e-06]
  [ 5.4710872e-06 -9.4246496e-07 -1.0494070e-06  9.8081841e-08
   -3.3251815e-06 -2.4862918e-06 -3.3642798e-06  1.0455864e-07
   -6.1001344e-07  2.0327841e-06]]
 <NDArray 2x10 @gpu(0)>,
 [[ 5.6176345e-06 -1.2837586e-06 -1.4605541e-06  1.8302967e-07
   -3.5511653e-06 -2.4371013e-06 -3.5731798e-06 -3.0974860e-07
   -1.1016571e-06  1.8909889e-06]
  [ 5.1418697e-06 -1.3729932e-06 -1.1520088e-06  1.1507450e-07
   -3.7372811e-06 -2.8289724e-06 -3.6477197e-06  1.5781629e-07
   -6.0733043e-07  1.9712013e-06]]
 <NDArray 2x10 @gpu(1)>)

Now we can access the initialized model parameter values through data. It should be noted that weight.data() will return the parameter values on the CPU by default. Since we specified 2 GPUs to initialize the model parameters, we need to specify the GPU to access parameter values. As we can see, the same parameters have the same values on different GPUs.

In [5]:
weight = net[0].params.get('weight')

try:
    weight.data()
except RuntimeError:
    print('not initialized on', mx.cpu())
weight.data(ctx[0])[0], weight.data(ctx[1])[0]
not initialized on cpu(0)
Out[5]:
(
 [[[-0.01473444 -0.01073093 -0.01042483]
   [-0.01327885 -0.01474966 -0.00524142]
   [ 0.01266256  0.00895064 -0.00601594]]]
 <NDArray 1x3x3 @gpu(0)>,
 [[[-0.01473444 -0.01073093 -0.01042483]
   [-0.01327885 -0.01474966 -0.00524142]
   [ 0.01266256  0.00895064 -0.00601594]]]
 <NDArray 1x3x3 @gpu(1)>)

Multi-GPU Model Training

When we use multiple GPUs to train the model, the Trainer instance will automatically perform data parallelism, such as dividing mini-batches of data instances and copying them to individual GPUs and summing the gradients of each GPU and broadcasting the result to all GPUs. In this way, we can easily implement the training function.

In [6]:
def train(num_gpus, batch_size, lr):
    train_iter, test_iter = gb.load_data_fashion_mnist(batch_size)
    ctx = [mx.gpu(i) for i in range(num_gpus)]
    print('running on:', ctx)
    net.initialize(init=init.Normal(sigma=0.01), ctx=ctx, force_reinit=True)
    trainer = gluon.Trainer(
        net.collect_params(), 'sgd', {'learning_rate': lr})
    loss = gloss.SoftmaxCrossEntropyLoss()
    for epoch in range(4):
        start = time.time()
        for X, y in train_iter:
            gpu_Xs = gutils.split_and_load(X, ctx)
            gpu_ys = gutils.split_and_load(y, ctx)
            with autograd.record():
                ls = [loss(net(gpu_X), gpu_y)
                      for gpu_X, gpu_y in zip(gpu_Xs, gpu_ys)]
            for l in ls:
                l.backward()
            trainer.step(batch_size)
        nd.waitall()
        train_time = time.time() - start
        test_acc = gb.evaluate_accuracy(test_iter, net, ctx[0])
        print('epoch %d, training time: %.1f sec, test_acc %.2f' % (
            epoch + 1, train_time, test_acc))

First, use a single GPU for training.

In [7]:
train(num_gpus=1, batch_size=256, lr=0.1)
running on: [gpu(0)]
epoch 1, training time: 63.9 sec, test_acc 0.89
epoch 2, training time: 61.2 sec, test_acc 0.80
epoch 3, training time: 61.2 sec, test_acc 0.92
epoch 4, training time: 61.2 sec, test_acc 0.92

Then we try to use 2 GPUs for training. Compared with the LeNet used in the previous section, ResNet-18 computing is more complicated and the communication time is shorter compared to the calculation time, so parallel computing in ResNet-18 better improves performance.

In [8]:
train(num_gpus=2, batch_size=512, lr=0.2)
running on: [gpu(0), gpu(1)]
epoch 1, training time: 32.3 sec, test_acc 0.75
epoch 2, training time: 31.4 sec, test_acc 0.84
epoch 3, training time: 31.2 sec, test_acc 0.90
epoch 4, training time: 31.5 sec, test_acc 0.89

Summary

  • In Gluon, we can conveniently perform multi-GPU computations, such as initializing model parameters and training models on multiple GPUs.

Problems

  • This section uses ResNet-18. Try different epochs, batch sizes, and learning rates. Use more GPUs for computation if conditions permit.
  • Sometimes, different devices provide different computing power. Some can use CPUs and GPUs at the same time, or GPUs of different models. How should we divide mini-batches among different CPUs or GPUs?

Discuss on our Forum