Abstract
It has become a common practice to train large machine learning (ML) models across a cluster of computing nodes connected by RDMA-enabled networks. However, the communication overhead caused by parameter synchronization deteriorates the performance of such distributed ML (DML), especially in a large-scale setting. This paper tackles this issue by developing a traffic management scheme to support DML traffic, called TMDML (Traffic Management for DML), which needs only a minor modification to the existing RDMA congestion control scheme DCQCN. We assume that there is only one instance of DML workload running in a network. Existing literature has shown that Fat-Tree, a predominant topology in the data center, poorly supports DML compared with BCube. With our proposed TMDML, training DML in Fat-Tree can achieve better performance than that in BCube. We first study the impact of multi-bottlenecks on DML via NS-3-based simulations. The results show that DCQCN is inefficient for DML traffic in the multi-bottlenecks scenario. To mitigate the impact of multi-bottlenecks, we propose an optimization model to minimize the maximum flow completion time (FCT) while stabilizing the queues, and then apply the Lyapunov optimization technique to solve the problem. For all the practical purposes, we present two heuristic implementations of TMDML for different deployment requirements. We evaluate the performance of our proposals by simulation, comparing it with DCQCN. We use All-Reduce parameter synchronization in Fat-Tree and BCube with traffic trace of modern deep neural network models, including AlexNet, ResNet50, and VGG-16. Our proposals can achieve up to 59% of the time reduction.
Original language | English |
---|---|
Title of host publication | ICC 2021 - IEEE International Conference on Communications, Proceedings |
Publisher | IEEE |
Pages | 1-6 |
Number of pages | 6 |
ISBN (Electronic) | 9781728171227 |
ISBN (Print) | 9781728171234 |
DOIs | |
Publication status | Published - Jun 2021 |
Event | 2021 IEEE International Conference on Communications, ICC 2021 - Virtual, Online, Montreal, Canada Duration: 14 Jun 2021 → 23 Jun 2021 https://ieeexplore.ieee.org/xpl/conhome/9500243/proceeding |
Publication series
Name | Proceedngs of IEEE International Conference on Communications |
---|---|
ISSN (Print) | 1550-3607 |
ISSN (Electronic) | 1938-1883 |
Conference
Conference | 2021 IEEE International Conference on Communications, ICC 2021 |
---|---|
Country/Territory | Canada |
City | Montreal |
Period | 14/06/21 → 23/06/21 |
Internet address |
Scopus Subject Areas
- Computer Networks and Communications
- Electrical and Electronic Engineering
User-Defined Keywords
- distributed machine learning (DML)
- multi-bottlenecks
- RDMA
- transport protocol