Energy-efficient Inference Service of Transformer-based Deep Learning Models on GPUs

Yuxin Wang, Qiang Wang, Xiaowen Chu

Research output: Chapter in book/report/conference proceedingConference proceedingpeer-review

3 Citations (Scopus)

Abstract

Inference-as-a-service (IAAS) has been recently launched by cloud service providers to support on-demand AI applications. Many natural language processing (NLP) services are based on the Transformer Sequence Transduction model. However, the inference process of the Transformer model consumes a significant amount of energy due to the large model size (e.g., billions of parameters) and tremendous computations. How to reduce the energy consumption of IAAS without violating the service-level agreement (SLA) becomes a practical challenge for service providers. In this work, we conduct a comprehensive study on the inference performance and energy efficiency of a Transformer model trained for the language translation service. First, we empirically characterize some essential performance metrics, including latency, throughput, and energy consumption on three different GPUs with diversified workload configurations. The detailed workload separation facilitates a thorough and deep understanding of the inference process of the Transformer model. Second, we provide an energy consumption model for the Transformer based on the observed data. Finally, we propose the Aligned scheduling scheme that optimizes throughput and energy efficiency with up to 2.86× and 2.73× improvement at the cost of 40% average latency loss. Our findings provide a full scope of Transformer inference, and suggest that the workload balancing and scheduling have great potentials to offer energy-efficient Transformer inference services.

Original languageEnglish
Title of host publicationProceedings - IEEE Congress on Cybermatics
Subtitle of host publication2020 IEEE International Conferences on Internet of Things, iThings 2020, IEEE Green Computing and Communications, GreenCom 2020, IEEE Cyber, Physical and Social Computing, CPSCom 2020 and IEEE Smart Data, SmartData 2020
PublisherIEEE
Pages323-331
Number of pages9
ISBN (Electronic)9781728176475
DOIs
Publication statusPublished - Nov 2020
Event2020 IEEE Congress on Cybermatics: 13th IEEE International Conferences on Internet of Things, iThings 2020, 16th IEEE International Conference on Green Computing and Communications, GreenCom 2020, 13th IEEE International Conference on Cyber, Physical and Social Computing, CPSCom 2020 and 6th IEEE International Conference on Smart Data, SmartData 2020 - Rhodes Island, Greece
Duration: 2 Nov 20206 Nov 2020

Publication series

NameProceedings - IEEE Congress on Cybermatics: 2020 IEEE International Conferences on Internet of Things, iThings 2020, IEEE Green Computing and Communications, GreenCom 2020, IEEE Cyber, Physical and Social Computing, CPSCom 2020 and IEEE Smart Data, SmartData 2020

Conference

Conference2020 IEEE Congress on Cybermatics: 13th IEEE International Conferences on Internet of Things, iThings 2020, 16th IEEE International Conference on Green Computing and Communications, GreenCom 2020, 13th IEEE International Conference on Cyber, Physical and Social Computing, CPSCom 2020 and 6th IEEE International Conference on Smart Data, SmartData 2020
Country/TerritoryGreece
CityRhodes Island
Period2/11/206/11/20

Scopus Subject Areas

  • Artificial Intelligence
  • Computer Networks and Communications
  • Hardware and Architecture
  • Information Systems and Management
  • Renewable Energy, Sustainability and the Environment
  • Communication

User-Defined Keywords

  • Batch Inference
  • Cloud Service
  • Energy Efficiency
  • Graphics Processing Units
  • Inference Scheduling
  • Transformer Model

Fingerprint

Dive into the research topics of 'Energy-efficient Inference Service of Transformer-based Deep Learning Models on GPUs'. Together they form a unique fingerprint.

Cite this