Energy-efficient Training of Multiple Deep Learning Models on GPU Clusters

Project: Research project

Project Details


Training complex deep learning models is very time-consuming. High performance GPU clusters have been deployed widely to accelerate training jobs. Many excellent research projects have been carried out recently to improve the scalability of GPU clusters for a single training job. However, there are two important issues that have not yet been well addressed, one related to performance and another one related to operational cost. First, a GPU cluster is often shared by multiple users who can execute their training jobs concurrently but compete for hardware and network resources. Our preliminary research shows that if those resources are not allocated well, the total time of multiple training jobs can become 2.83 times slower due to I/O and network competition, despite the fact that the same number of GPUs are used. Second, GPUs consume significantly more power than CPUs and generate high electricity costs that dominate the overall operational cost of the cluster. Hence, it becomes critical to reduce the overall energy consumption without affecting the performance. Although many energy conservation techniques, such as dynamic voltage and frequency scaling (DVFS) and resource allocation and task scheduling (RATS), have been proposed previously, the impact of DVFS on GPU performance and power is not fully understood due to the complicated GPU memory hierarchy. Developing accurate GPU performance and power models is also critical to solving the RATS problem.

To address these challenges, we propose to carry out the following research tasks. First, we will build up an open data set of the performance and power consumption of abundant GPU kernels with DVFS for contemporary GPUs. Second, based on the data set, we will continue to develop efficient yet accurate GPU performance and power models for deep learning training jobs. Third, based on the performance and power models, we will tackle the RATS problem by designing energy-efficient algorithms. Finally, we will implement an open-source prototype of a job management and scheduling system for GPU clusters and evaluate the performance of our proposals using real-world experiments.

We believe our new GPU performance and power models will be of interest to the general GPU computing community, and that our RATS algorithms and open-source system will make a significant contribution to the deep learning community
Effective start/end date1/09/1831/08/21


Explore the research topics touched on by this project. These labels are generated based on the underlying awards/grants. Together they form a unique fingerprint.