(RP04) Distributed Deep Learning with FPGA Ring Allreduce
AI/Machine Learning/Deep Learning
Clouds and Distributed Computing
TimeTuesday, June 18th8:30am - 10am
DescriptionAmong various methods for efficient distributed Deep Learning (DL), the top three state-of-the-art ImageNet/ResNet-50 training were achieved by utilizing a distributed data-parallel DL with Ring Allreduce or 2D-Torus Allreduce. However, it is difficult to apply them at large scale because latency is accumulated at each node due to data moving to GPU or CPU for Reduce processes. Our solution is to use In-Network Computing to handle data reduction while it is being transferred in the network. Since the conventional In-Network Computing system can apply to only hierarchical Allreduce, in this work, we propose a new In-Network Computing system that can support Ring Allreduce. In order to minimize communication overhead, we apply layer-based computing/communication overlap and optimize it for our proposed In-Network Computing system. We also propose a highly productive software stack consisting of a DL framework and heterogeneous device control languages. The evaluation results show that we can reduce the communication overhead by 84.27% at a batch size of 32 without any accuracy degradation. Moreover, the total learning time can be reduced by 7% when using 4 nodes learning system. It is confirmed that our system can significantly reduce the communication overhead without deteriorating accuracy when applying to a large-scale distributed DL with a large communication load. Although the current top data is 2-D Torus Allreduce using ASIC in domain specific architecture, the result shows that the communication overhead is shorter by applying the proposed system, which indicates the possibility of In-Network Computing.