TY - GEN
T1 - A DAG Model of Synchronous Stochastic Gradient Descent in Distributed Deep Learning
AU - Shi, Shaohuai
AU - WANG, Qiang
AU - CHU, Xiaowen
AU - Li, Bo
N1 - Publisher Copyright:
© 2018 IEEE.
PY - 2018/7/2
Y1 - 2018/7/2
N2 - With huge amounts of training data, deep learning has made great breakthroughs in many artificial intelligence (AI) applications. However, such large-scale data sets present computational challenges, requiring training to be distributed on a cluster equipped with accelerators like GPUs. With the fast increase of G PU computing power, the data communications among GPUs have become a potential bottleneck on the overall training performance. In this paper, we first propose a general directed acyclic graph (DAG) model to describe the distributed synchronous stochastic gradient descent (S-SG D) algorithm, which has been widely used in distributed deep learning frameworks. To understand the practical impact of data communications on training performance, we conduct extensive empirical studies on four state-of-the-art distributed deep learning frameworks (i.e., Caffe-MPI, CNTK, MXNet and TensorFlow) over multi-GPU and multi-node environments with different data communication techniques, including PCIe, NVLink, 10GbE, and InfiniBand. Through both analytical and experimental studies, we identify the potential bottlenecks and overheads that could be further optimized. At last, we make the data set of our experimental traces publicly available, which could be used to support simulation-based studies.
AB - With huge amounts of training data, deep learning has made great breakthroughs in many artificial intelligence (AI) applications. However, such large-scale data sets present computational challenges, requiring training to be distributed on a cluster equipped with accelerators like GPUs. With the fast increase of G PU computing power, the data communications among GPUs have become a potential bottleneck on the overall training performance. In this paper, we first propose a general directed acyclic graph (DAG) model to describe the distributed synchronous stochastic gradient descent (S-SG D) algorithm, which has been widely used in distributed deep learning frameworks. To understand the practical impact of data communications on training performance, we conduct extensive empirical studies on four state-of-the-art distributed deep learning frameworks (i.e., Caffe-MPI, CNTK, MXNet and TensorFlow) over multi-GPU and multi-node environments with different data communication techniques, including PCIe, NVLink, 10GbE, and InfiniBand. Through both analytical and experimental studies, we identify the potential bottlenecks and overheads that could be further optimized. At last, we make the data set of our experimental traces publicly available, which could be used to support simulation-based studies.
KW - Deep Learning
KW - Directed Acyclic Graph
KW - Graphics Processing Units
KW - InfiniBand
KW - NVLink
KW - Stochastic Gradient Descent
UR - http://www.scopus.com/inward/record.url?scp=85063348673&partnerID=8YFLogxK
U2 - 10.1109/PADSW.2018.8644932
DO - 10.1109/PADSW.2018.8644932
M3 - Conference proceeding
AN - SCOPUS:85063348673
T3 - Proceedings of the International Conference on Parallel and Distributed Systems - ICPADS
SP - 425
EP - 432
BT - Proceedings - 2018 IEEE 24th International Conference on Parallel and Distributed Systems, ICPADS 2018
PB - IEEE Computer Society
T2 - 24th IEEE International Conference on Parallel and Distributed Systems, ICPADS 2018
Y2 - 11 December 2018 through 13 December 2018
ER -