Training bigger neural networks can be challenging when faced with accelerator memory limits. The size of the datasets being used by machine learning models is very large nowadays. For example, a standard image classification datasets like hashtagged Instagram contains millions of images. With increasing quality of the images, the memory required will also increase. Today, the memory available on NVIDIA GPUs is only 32 GB.
Therefore, there needs to be a tradeoff between memory allocated for the features in a model and how the network gets activated. It is only understandable why the accelerator memory limit needs to be breached.
A deep neural network benefits from larger datasets as it alleviates the problem of overfitting. And, to run these ever growing networks, we need deep learning supercomputers such as Google TPU or NVIDIA’s DGX which enable parallelism by providing faster interconnections between the accelerators.
Today, an average ImageNet resolution is 469 x 387 and it has been proven that by increasing the size of an input image, the final accuracy score of a classifier increases. To fit the current accelerator memory limits, most models are made to process images of sizes 299 x 299 or 331 x 331.
GPipe can be used to parse a model across different accelerators and to automatically split a mini-batch of training examples into micro-batches. Pipelining allows the accelerators to function with parallelism.
The memory required to update the weights during backpropagation can be reduced with GPipe as it automatically calculates the forward activations during backpropagation. Hence enabling the users to use more accelerators for training larger models and achieving performances to scale without filtering hyperparameters.
Researchers at Google Brain say, “GPipe can support models up to 25 times larger using 8 accelerators without reducing the batch size. The implementation of GPipe is very efficient: with 4 times more accelerators we can achieve a 3.5 times speedup for training giant neural networks.”
So, to test and demonstrate the GPipe’s functionality, the researchers have used ImageNet ILSRVC 2012 dataset where they use up 557 million parameters with an input image size of 480 x 480. And, this scaled up AmoebaNet model attains validation accuracy of 84.3 % top-1 outperforming all other models trained from scratch on ImageNet dataset.
The 2014 ImageNet challenge has seen accuracy scores of 74.8% with 4 million parameters. And, in 2017 the accuracy has risen to 82.7% while using up 145.8 million parameters which is 36 times the number of parameters used previously.
The researchers have also managed to push the CIFAR-10 accuracy to 99%. The CIFAR-10 dataset contains 60,000 32 x 32 color images in 10 different classes. The 10 different classes represent aeroplanes, cars, birds, cats, deer, dogs, frogs, horses, ships, and trucks. There are 6,000 images of each class.
Design Features Of GPipe
The core algorithm has been implemented using TensorFlow library. By invoking a GPipe library, the user specifies a sequential list of L layers. Where each layer specifies model parameters, stateless forward computation function and an optional cost estimation function.
After the layer specifications have been defined, GPipe partitions the network into K composite layers and places k-th composite layer onto k-th accelerator. The number of partitions, ‘K’ is user-defined and During training, GPipe first divides a mini-batch of size N into T micro-batches at the first layer. Each micro-batch contains N/T examples.
Each accelerator only stores output activations at the partition boundaries, rather than activations of all intermediate layers within the partition. The accelerator recomputes the composite forward function and requires only the cached activations at partition boundaries; reducing the overall memory allocation.
The gradients for each micro-batch are computed based on the same model parameters as the forward pass. At the end of each mini-batch, the model parameters are updated across accelerators by applying gradients. So, GPipe, in a way resonates with the nature of gradient descent independent of number of partitions.
To scale up the models, RMSProp optimizer with a decay of 0.9 and label smoothing coefficient equal to 0.1have been used. The learning rate is scheduled to decay after 3 epochs at a rate of 0.97 with an initial learning rate of 0.00125 times the batch size. This scaled up giant model reached 84.3% top-1 accuracy with single-crop.
What Do Results Say
With GPipe, it is possible to:
- Support models up to 25 times using 8 accelerators due to recomputation and model parallelism.
- Achieve up to 3.5 times speedup with four times more accelerators using pipelining in our experiments.
- Train consistently regardless of the number of partitions due to synchronous gradient descent.
- Free researchers from the time consuming process of re-tuning hyperparameters. So, GPipe can be combined with data parallelism to scale neural network training using more accelerators.
- Advance the performance of visual recognition tasks on multiple datasets, including pushing ImageNet top-5 accuracy to 97.0%, CIFAR-10 accuracy to 99.0%, and CIFAR-100 accuracy to 91.3%.
- The training efficiency of GPipe can be further improved by better graph partition algorithms.