Abstract
To accelerate distributed training, many gradient compression methods have been proposed to alleviate the communication bottleneck in synchronous stochastic gradient descent (S-SGD), but their efficacy in real-world applications still remains unclear. In this work, we first evaluate the efficiency of three representative compression methods (quantization with Sign-SGD, sparsification with Top-k SGD, and low-rank with Power-SGD) on a 32-GPU cluster. The results show that they cannot always outperform well-optimized S-SGD or even worse due to their incompatibility with three key system optimization techniques (all-reduce, pipelining, and tensor fusion) in S-SGD. To this end, we propose a novel gradient compression method, called alternate compressed Power-SGD (ACP-SGD), which alternately compresses and communicates low-rank matrices. ACP-SGD not only significantly reduces the communication volume, but also enjoys the three system optimizations like S-SGD. Compared with Power-SGD, the optimized ACP-SGD can largely reduce the compression and communication overheads, while achieving similar model accuracy. In our experiments, ACP-SGD achieves an average of 4.06× and 1.43× speedups over S-SGD and Power-SGD, respectively, and it consistently outperforms other baselines across different setups (from 8 GPUs to 64 GPUs and from 1Gb/s Ethernet to 100Gb/s InfiniBand).
Original language | English |
---|---|
Title of host publication | Proceedings - 2023 IEEE 43rd International Conference on Distributed Computing Systems, ICDCS 2023 |
Publisher | IEEE |
Pages | 361-371 |
Number of pages | 11 |
ISBN (Electronic) | 9798350339864 |
ISBN (Print) | 9798350339871 |
DOIs | |
Publication status | Published - 18 Jul 2023 |
Event | 43rd IEEE International Conference on Distributed Computing Systems - , Hong Kong Duration: 18 Jul 2023 → 21 Jul 2023 https://ieeexplore.ieee.org/xpl/conhome/10272385/proceeding (conference proceeding) https://icdcs2023.icdcs.org/ (conference website) |
Publication series
Name | Proceedings - International Conference on Distributed Computing Systems |
---|---|
Publisher | IEEE |
Volume | 2023-July |
ISSN (Print) | 1063-6927 |
ISSN (Electronic) | 2575-8411 |
Conference
Conference | 43rd IEEE International Conference on Distributed Computing Systems |
---|---|
Abbreviated title | ICDCS 2023 |
Country/Territory | Hong Kong |
Period | 18/07/23 → 21/07/23 |
Internet address |
|
User-Defined Keywords
- Distributed Deep Learning
- Gradient Compression
- Power-SGD
- System Optimization