TY - GEN
T1 - Layer-wise adaptive gradient sparsification for distributed deep learning with convergence guarantees
AU - Shi, Shaohuai
AU - Tang, Zhenheng
AU - Wang, Qiang
AU - Zhao, Kaiyong
AU - Chu, Xiaowen
N1 - Funding Information:
The research was supported by Hong Kong RGC GRF grant HKBU 12200418. We acknowledge Nvidia AI Technology Centre (NVAITC) for providing GPU clusters for experiments.
PY - 2020/8/24
Y1 - 2020/8/24
N2 - To reduce the long training time of large deep neural network (DNN) models, distributed synchronous stochastic gradient descent (S-SGD) is commonly used on a cluster of workers. However, the speedup brought by multiple workers is limited by the communication overhead. Two approaches, namely pipelining and gradient sparsification, have been separately proposed to alleviate the impact of communication overheads. Yet, the gradient sparsification methods can only initiate the communication after the backpropagation, and hence miss the pipelining opportunity. In this paper, we propose a new distributed optimization method named LAGS-SGD, which combines S-SGD with a novel layer-wise adaptive gradient sparsification (LAGS) scheme. In LAGS-SGD, every worker selects a small set of 'significant' gradients from each layer independently whose size can be adaptive to the communication-to-computation ratio of that layer. The layer-wise nature of LAGS-SGD opens the opportunity of overlapping communications with computations, while the adaptive nature of LAGS-SGD makes it flexible to control the communication time. We prove that LAGS-SGD has convergence guarantees and it has the same order of convergence rate as vanilla S-SGD under a weak analytical assumption. Extensive experiments are conducted to verify the analytical assumption and the convergence performance of LAGS-SGD. Experimental results on a 16-GPU cluster show that LAGS-SGD outperforms the original S-SGD and existing sparsified S-SGD without losing obvious model accuracy.
AB - To reduce the long training time of large deep neural network (DNN) models, distributed synchronous stochastic gradient descent (S-SGD) is commonly used on a cluster of workers. However, the speedup brought by multiple workers is limited by the communication overhead. Two approaches, namely pipelining and gradient sparsification, have been separately proposed to alleviate the impact of communication overheads. Yet, the gradient sparsification methods can only initiate the communication after the backpropagation, and hence miss the pipelining opportunity. In this paper, we propose a new distributed optimization method named LAGS-SGD, which combines S-SGD with a novel layer-wise adaptive gradient sparsification (LAGS) scheme. In LAGS-SGD, every worker selects a small set of 'significant' gradients from each layer independently whose size can be adaptive to the communication-to-computation ratio of that layer. The layer-wise nature of LAGS-SGD opens the opportunity of overlapping communications with computations, while the adaptive nature of LAGS-SGD makes it flexible to control the communication time. We prove that LAGS-SGD has convergence guarantees and it has the same order of convergence rate as vanilla S-SGD under a weak analytical assumption. Extensive experiments are conducted to verify the analytical assumption and the convergence performance of LAGS-SGD. Experimental results on a 16-GPU cluster show that LAGS-SGD outperforms the original S-SGD and existing sparsified S-SGD without losing obvious model accuracy.
UR - http://www.scopus.com/inward/record.url?scp=85090287513&partnerID=8YFLogxK
U2 - 10.3233/FAIA200253
DO - 10.3233/FAIA200253
M3 - Conference proceeding
AN - SCOPUS:85090287513
T3 - Frontiers in Artificial Intelligence and Applications
SP - 1467
EP - 1474
BT - ECAI 2020 - 24th European Conference on Artificial Intelligence, including 10th Conference on Prestigious Applications of Artificial Intelligence, PAIS 2020 - Proceedings
A2 - De Giacomo, Giuseppe
A2 - Catala, Alejandro
A2 - Dilkina, Bistra
A2 - Milano, Michela
A2 - Barro, Senen
A2 - Bugarin, Alberto
A2 - Lang, Jerome
PB - IOS Press BV
T2 - 24th European Conference on Artificial Intelligence, ECAI 2020, including 10th Conference on Prestigious Applications of Artificial Intelligence, PAIS 2020
Y2 - 29 August 2020 through 8 September 2020
ER -