Traffic Management for Distributed Machine Learning in RDMA-enabled Data Center Networks

Weihong Yang, Yang Qin*, Zukai Jiang, Xiaowen Chu

*Corresponding author for this work

Research output: Chapter in book/report/conference proceedingConference contributionpeer-review

Abstract

It has become a common practice to train large machine learning (ML) models across a cluster of computing nodes connected by RDMA-enabled networks. However, the communication overhead caused by parameter synchronization deteriorates the performance of such distributed ML (DML), especially in a large-scale setting. This paper tackles this issue by developing a traffic management scheme to support DML traffic, called TMDML (Traffic Management for DML), which needs only a minor modification to the existing RDMA congestion control scheme DCQCN. We assume that there is only one instance of DML workload running in a network. Existing literature has shown that Fat-Tree, a predominant topology in the data center, poorly supports DML compared with BCube. With our proposed TMDML, training DML in Fat-Tree can achieve better performance than that in BCube. We first study the impact of multi-bottlenecks on DML via NS-3-based simulations. The results show that DCQCN is inefficient for DML traffic in the multi-bottlenecks scenario. To mitigate the impact of multi-bottlenecks, we propose an optimization model to minimize the maximum flow completion time (FCT) while stabilizing the queues, and then apply the Lyapunov optimization technique to solve the problem. For all the practical purposes, we present two heuristic implementations of TMDML for different deployment requirements. We evaluate the performance of our proposals by simulation, comparing it with DCQCN. We use All-Reduce parameter synchronization in Fat-Tree and BCube with traffic trace of modern deep neural network models, including AlexNet, ResNet50, and VGG-16. Our proposals can achieve up to 59% of the time reduction.

Original languageEnglish
Title of host publicationICC 2021 - IEEE International Conference on Communications, Proceedings
PublisherIEEE
Pages1-6
Number of pages6
ISBN (Electronic)9781728171227
ISBN (Print)9781728171234
DOIs
Publication statusPublished - Jun 2021
Event2021 IEEE International Conference on Communications, ICC 2021 - Virtual, Online, Canada
Duration: 14 Jun 202123 Jun 2021

Publication series

NameProceedngs of IEEE International Conference on Communications
ISSN (Print)1550-3607
ISSN (Electronic)1938-1883

Conference

Conference2021 IEEE International Conference on Communications, ICC 2021
Country/TerritoryCanada
CityVirtual, Online
Period14/06/2123/06/21

Scopus Subject Areas

  • Computer Networks and Communications
  • Electrical and Electronic Engineering

User-Defined Keywords

  • distributed machine learning (DML)
  • multi-bottlenecks
  • RDMA
  • transport protocol

Fingerprint

Dive into the research topics of 'Traffic Management for Distributed Machine Learning in RDMA-enabled Data Center Networks'. Together they form a unique fingerprint.

Cite this