TY - GEN
T1 - A convergence analysis of distributed SGD with communication-efficient gradient sparsification
AU - Shi, Shaohuai
AU - Zhao, Kaiyong
AU - WANG, Qiang
AU - Tang, Zhenheng
AU - CHU, Xiaowen
N1 - Funding Information:
The research was supported by Hong Kong RGC GRF grant HKBU 12200418.
PY - 2019/8
Y1 - 2019/8
N2 - Gradient sparsification is a promising technique to significantly reduce the communication overhead in decentralized synchronous stochastic gradient descent (S-SGD) algorithms. Yet, many existing gradient sparsification schemes (e.g., Top-k sparsification) have a communication complexity of O(kP ), where k is the number of selected gradients by each worker and P is the number of workers. Recently, the gTop-k sparsification scheme has been proposed to reduce the communication complexity from O(kP ) to O(k log P ), which significantly boosts the system scalability. However, it remains unclear whether the gTop-k sparsification scheme can converge in theory. In this paper, we first provide theoretical proofs on the convergence of the gTop-k scheme for non-convex objective functions under certain analytic assumptions. We then derive the convergence rate of gTop-k S-SGD, which is at the same order as the vanilla minibatch SGD. Finally, we conduct extensive experiments on different machine learning models and data sets to verify the soundness of the assumptions and theoretical results, and discuss the impact of the compression ratio on the convergence performance.
AB - Gradient sparsification is a promising technique to significantly reduce the communication overhead in decentralized synchronous stochastic gradient descent (S-SGD) algorithms. Yet, many existing gradient sparsification schemes (e.g., Top-k sparsification) have a communication complexity of O(kP ), where k is the number of selected gradients by each worker and P is the number of workers. Recently, the gTop-k sparsification scheme has been proposed to reduce the communication complexity from O(kP ) to O(k log P ), which significantly boosts the system scalability. However, it remains unclear whether the gTop-k sparsification scheme can converge in theory. In this paper, we first provide theoretical proofs on the convergence of the gTop-k scheme for non-convex objective functions under certain analytic assumptions. We then derive the convergence rate of gTop-k S-SGD, which is at the same order as the vanilla minibatch SGD. Finally, we conduct extensive experiments on different machine learning models and data sets to verify the soundness of the assumptions and theoretical results, and discuss the impact of the compression ratio on the convergence performance.
UR - https://www.ijcai.org/proceedings/2019/
UR - http://www.scopus.com/inward/record.url?scp=85074915733&partnerID=8YFLogxK
U2 - 10.24963/ijcai.2019/473
DO - 10.24963/ijcai.2019/473
M3 - Conference contribution
AN - SCOPUS:85074915733
T3 - IJCAI International Joint Conference on Artificial Intelligence
SP - 3411
EP - 3417
BT - Proceedings of the 28th International Joint Conference on Artificial Intelligence, IJCAI 2019
A2 - Kraus, Sarit
PB - International Joint Conferences on Artificial Intelligence
T2 - 28th International Joint Conference on Artificial Intelligence, IJCAI 2019
Y2 - 10 August 2019 through 16 August 2019
ER -