Evaluation and Optimization of Gradient Compression for Distributed Deep Learning

Lin Zhang, Longteng Zhang, Shaohuai Shi*, Xiaowen Chu*, Bo Li

*Corresponding author for this work

Research output: Chapter in book/report/conference proceedingConference proceedingpeer-review

5 Citations (Scopus)

Abstract

To accelerate distributed training, many gradient compression methods have been proposed to alleviate the communication bottleneck in synchronous stochastic gradient descent (S-SGD), but their efficacy in real-world applications still remains unclear. In this work, we first evaluate the efficiency of three representative compression methods (quantization with Sign-SGD, sparsification with Top-k SGD, and low-rank with Power-SGD) on a 32-GPU cluster. The results show that they cannot always outperform well-optimized S-SGD or even worse due to their incompatibility with three key system optimization techniques (all-reduce, pipelining, and tensor fusion) in S-SGD. To this end, we propose a novel gradient compression method, called alternate compressed Power-SGD (ACP-SGD), which alternately compresses and communicates low-rank matrices. ACP-SGD not only significantly reduces the communication volume, but also enjoys the three system optimizations like S-SGD. Compared with Power-SGD, the optimized ACP-SGD can largely reduce the compression and communication overheads, while achieving similar model accuracy. In our experiments, ACP-SGD achieves an average of 4.06× and 1.43× speedups over S-SGD and Power-SGD, respectively, and it consistently outperforms other baselines across different setups (from 8 GPUs to 64 GPUs and from 1Gb/s Ethernet to 100Gb/s InfiniBand).

Original languageEnglish
Title of host publicationProceedings - 2023 IEEE 43rd International Conference on Distributed Computing Systems, ICDCS 2023
PublisherIEEE
Pages361-371
Number of pages11
ISBN (Electronic)9798350339864
ISBN (Print)9798350339871
DOIs
Publication statusPublished - 18 Jul 2023
Event43rd IEEE International Conference on Distributed Computing Systems - , Hong Kong
Duration: 18 Jul 202321 Jul 2023
https://ieeexplore.ieee.org/xpl/conhome/10272385/proceeding (conference proceeding)
https://icdcs2023.icdcs.org/ (conference website)

Publication series

NameProceedings - International Conference on Distributed Computing Systems
PublisherIEEE
Volume2023-July
ISSN (Print)1063-6927
ISSN (Electronic)2575-8411

Conference

Conference43rd IEEE International Conference on Distributed Computing Systems
Abbreviated titleICDCS 2023
Country/TerritoryHong Kong
Period18/07/2321/07/23
Internet address

User-Defined Keywords

  • Distributed Deep Learning
  • Gradient Compression
  • Power-SGD
  • System Optimization

Fingerprint

Dive into the research topics of 'Evaluation and Optimization of Gradient Compression for Distributed Deep Learning'. Together they form a unique fingerprint.

Cite this