Sony recently announced that it has achieved the best deep learning speed in the industry using distributed learning. Using its own deep learning developmental framework and its in-house library, Sony has achieved a feat that many in the deep learning world would envy. They a state-of-the-art cloud computing infrastructure called AI Bridging Cloud Infrastructure (ABCI), built by Japan’s National Institute of Advanced Industrial Science and Technology (AIST).
Sony is not one of the most advanced companies when it comes to deep learning and AI research but it still has a formidable research team and innovation practice. Sony has now worked on machine learning for many years now and finally figured out how to build distributed systems that can use multiple GPU instances to reduce the learning times of large neural networks.
In a supercilious press release, Sony stated, “The results of this experiment demonstrate that learning/execution carried out using Neural Network Libraries can achieve world-class speeds.” They also believe that leveraging the same framework for deep learning it will be easier to do shorter trial and error period experiments. Sony also said it is committed to the development of AI that will lead to a better society.
How They Did It
Distributed training of large neural networks although seems a good bet to get great results, but it is very hard due to the instability of the large mini batch training and heavy resource utilization of the gradient synchronization. The researchers at Sony controlled the batch size to make the system more stable and the overhead generated due to gradient synchronization with a technique called as 2D-Torus all-reduce.
Both of these unique techniques are available in Sony’s Neural Network Library and this resulted in a world record beating training time without much loss in accuracy on the ImageNet datasets. The fastest training time till now according to the paper was 1 hour using 256 Tesla P100 GPUs. The researchers were also successful in getting the GPU scaling efficiency 91.62% with 1088 Tesla V100 GPUs.
2D-Torus: The Hidden Magic
There are certain communication gaps that need to be fulfilled when building robust large-scale distributed deep neural networks. The topology that the researchers introduced including the Ring all-reduce and hierarchical Ring all-reduce are aimed at improving the efficiency of communication between GPUs. The researchers building over the previous works have come up with 2D Torus topology which is an all reduce method made up of three essential steps:
This structure and procedure where sequences of reducing and scatter operations are put precisely drastically lessen the communication overhead. In a hypothetical setting let:
- represents the number of GPUs in the cluster,
- represents the number of GPUs in the horizontal direction,
- represents the number of GPUs in the vertical direction.
Then the 2D Torus technique executes 2( − 1) GPU-to-GPU operations. In the same sense comparatively, Ring all-reduce scheme executes 2( − 1) GPU-to-GPU operations. The researchers found out that hierarchical all-reduce method also does the same amount of GPU-to-GPU operation as the 2D-Torus all-reduce method does. They also found that the data size of 2D-Torus all-reduce scheme is times smaller than that of the hierarchical all-reduce scheme.
The experimental system used by the researchers is broken down into software, hardware systems, datasets an
Software: They used Neural Network Libraries (NNL) by Sony and the CUDA extension for deep learning framework. The 2D-Torus all-reduce is implemented with NCCL2 and it also comes packaged in a Singularity container.
Hardware: The researchers used AI Bridging Cloud Infrastructure (ABCI) as a GPU computing facility. The cluster included 1088 nodes where is node is made up of 4 NVIDIA Tesla V100 GPUs along with 2 Xeon Gold 6148 processors having 376 GB of memory.
Dataset and Model: The datasets used by researchers is the ImageNet dataset. This is a dataset for 1,000 classes for image classification. ImageNet contains close to 1.28 million training images along with 50,000 validation images. The researchers used Sony’s own implementation of image augmentation operations including padding, scaling, rotations, resizing, distortion, flipping, brightness adjustment, contrast adjustment, and noising in all our experiments. The researchers used the ResNet-50 as a deep learning model.
Training Settings: The Sony researchers used LARS (Layer-wise Adaptive Rate Scaling) with a coefficient of 0.01 and eps of 1e-6 to update the weights.
Results and Conclusion
As the result of the experimentation, researchers were able to finish in ResNet-50 training in 224 seconds with no significant accuracy loss. The researchers also claimed other achievements and say, “We describe training speed and GPU scaling efficiency compared to a single node (4 GPUs) of our method.” Talking about the improvements improved they said, “Compared to the previous research, our communication scheme achieved higher GPU scaling efficiency with faster GPUs (Tesla V100) and more GPUs.”
The researchers concluded, “ We employ several techniques to reduce accuracy degradation while maintaining high GPU scaling efficiency when training with an enormous GPU cluster. We achieved the training time of 224 seconds and the validation accuracy of 75.03% using 2176 Tesla V100 GPUs.”