TY - GEN
T1 - Communication-Efficient Distributed Deep Learning with Merged Gradient Sparsification on GPUs
AU - Shi, Shaohuai
AU - Wang, Qiang
AU - Chu, Xiaowen
AU - Li, Bo
AU - Qin, Yang
AU - Liu, Ruihao
AU - Zhao, Xinxiao
N1 - Funding Information:
The research was supported in part by Hong Kong RGC GRF grants under the contracts HKBU 12200418, HKUST 16206417 and 16207818. We would also like to thank Nvidia AI Technology Centre (NVAITC) for providing the GPU clusters for some experiments.
PY - 2020/7
Y1 - 2020/7
N2 - Distributed synchronous stochastic gradient descent (SGD) algorithms are widely used in large-scale deep learning applications, while it is known that the communication bottleneck limits the scalability of the distributed system. Gradient sparsification is a promising technique to significantly reduce the communication traffic, while pipelining can further overlap the communications with computations. However, gradient sparsification introduces extra computation time, and pipelining requires many layer-wise communications which introduce significant communication startup overheads. Merging gradients from neighbor layers could reduce the startup overheads, but on the other hand it would increase the computation time of sparsification and the waiting time for the gradient computation. In this paper, we formulate the trade-off between communications and computations (including backward computation and gradient sparsification) as an optimization problem, and derive an optimal solution to the problem. We further develop the optimal merged gradient sparsification algorithm with SGD (OMGS-SGD) for distributed training of deep learning. We conduct extensive experiments to verify the convergence properties and scaling performance of OMGS-SGD. Experimental results show that OMGS-SGD achieves up to 31% end-to-end time efficiency improvement over the state-of-the-art sparsified SGD while preserving nearly consistent convergence performance with original SGD without sparsification on a 16-GPU cluster connected with 1Gbps Ethernet.
AB - Distributed synchronous stochastic gradient descent (SGD) algorithms are widely used in large-scale deep learning applications, while it is known that the communication bottleneck limits the scalability of the distributed system. Gradient sparsification is a promising technique to significantly reduce the communication traffic, while pipelining can further overlap the communications with computations. However, gradient sparsification introduces extra computation time, and pipelining requires many layer-wise communications which introduce significant communication startup overheads. Merging gradients from neighbor layers could reduce the startup overheads, but on the other hand it would increase the computation time of sparsification and the waiting time for the gradient computation. In this paper, we formulate the trade-off between communications and computations (including backward computation and gradient sparsification) as an optimization problem, and derive an optimal solution to the problem. We further develop the optimal merged gradient sparsification algorithm with SGD (OMGS-SGD) for distributed training of deep learning. We conduct extensive experiments to verify the convergence properties and scaling performance of OMGS-SGD. Experimental results show that OMGS-SGD achieves up to 31% end-to-end time efficiency improvement over the state-of-the-art sparsified SGD while preserving nearly consistent convergence performance with original SGD without sparsification on a 16-GPU cluster connected with 1Gbps Ethernet.
KW - Distributed Deep Learning
KW - Gradient Communication
KW - Merged Gradient
UR - http://www.scopus.com/inward/record.url?scp=85090291269&partnerID=8YFLogxK
U2 - 10.1109/INFOCOM41043.2020.9155269
DO - 10.1109/INFOCOM41043.2020.9155269
M3 - Conference proceeding
AN - SCOPUS:85090291269
T3 - Proceedings - IEEE INFOCOM
SP - 406
EP - 415
BT - INFOCOM 2020 - IEEE Conference on Computer Communications
PB - IEEE
T2 - 38th IEEE Conference on Computer Communications, IEEE INFOCOM 2020
Y2 - 6 July 2020 through 9 July 2020
ER -