A DAG Model of Synchronous Stochastic Gradient Descent in Distributed Deep Learning

Shaohuai Shi, Qiang WANG, Xiaowen CHU, Bo Li

Research output: Chapter in book/report/conference proceedingConference proceedingpeer-review

20 Citations (Scopus)

Abstract

With huge amounts of training data, deep learning has made great breakthroughs in many artificial intelligence (AI) applications. However, such large-scale data sets present computational challenges, requiring training to be distributed on a cluster equipped with accelerators like GPUs. With the fast increase of G PU computing power, the data communications among GPUs have become a potential bottleneck on the overall training performance. In this paper, we first propose a general directed acyclic graph (DAG) model to describe the distributed synchronous stochastic gradient descent (S-SG D) algorithm, which has been widely used in distributed deep learning frameworks. To understand the practical impact of data communications on training performance, we conduct extensive empirical studies on four state-of-the-art distributed deep learning frameworks (i.e., Caffe-MPI, CNTK, MXNet and TensorFlow) over multi-GPU and multi-node environments with different data communication techniques, including PCIe, NVLink, 10GbE, and InfiniBand. Through both analytical and experimental studies, we identify the potential bottlenecks and overheads that could be further optimized. At last, we make the data set of our experimental traces publicly available, which could be used to support simulation-based studies.

Original languageEnglish
Title of host publicationProceedings - 2018 IEEE 24th International Conference on Parallel and Distributed Systems, ICPADS 2018
PublisherIEEE Computer Society
Pages425-432
Number of pages8
ISBN (Electronic)9781538673089
DOIs
Publication statusPublished - 2 Jul 2018
Event24th IEEE International Conference on Parallel and Distributed Systems, ICPADS 2018 - Singapore, Singapore
Duration: 11 Dec 201813 Dec 2018

Publication series

NameProceedings of the International Conference on Parallel and Distributed Systems - ICPADS
Volume2018-December
ISSN (Print)1521-9097

Conference

Conference24th IEEE International Conference on Parallel and Distributed Systems, ICPADS 2018
Country/TerritorySingapore
CitySingapore
Period11/12/1813/12/18

Scopus Subject Areas

  • Hardware and Architecture

User-Defined Keywords

  • Deep Learning
  • Directed Acyclic Graph
  • Graphics Processing Units
  • InfiniBand
  • NVLink
  • Stochastic Gradient Descent

Fingerprint

Dive into the research topics of 'A DAG Model of Synchronous Stochastic Gradient Descent in Distributed Deep Learning'. Together they form a unique fingerprint.

Cite this