Energy-efficient Online Scheduling of Transformer Inference Services on GPU Servers

Yuxin Wang, Qiang Wang, Xiaowen Chu*

*Corresponding author for this work

Research output: Contribution to journalJournal articlepeer-review

1 Citation (Scopus)


Cloud service providers are deploying Transformer-based deep learning models on GPU servers to support many online inference-as-a-service (IAAS) applications, given the predominant performance of Transformers in natural language processing (NLP) tasks. However, Transformers' inherent high complexity and large model size (e.g., billions to hundreds of billions of parameters) tax the resource-constrained GPU servers. Improving the energy efficiency and payload capability of IAAS without violating the service-level agreement (SLA) becomes a practical challenge for service providers. This work conducts a comprehensive study on the inference performance and energy efficiency of Transformer models. First, we empirically characterize essential performance metrics, including latency, throughput, and energy consumption on NVIDIA GPUs under various workload configurations. Second, we establish a performance and energy consumption model for Transformer that facilitates energy-efficient scheduling policies. Finally, we propose an online batch inference scheduling scheme for Transformer on GPU servers, which we refer to as the Mixed Aligned Scheduling (MAS) scheme. Compared with the existing scheduling schemes, the MAS improves the throughput and energy efficiency by up to 61.56% and 69.79% on the V100 GPU servers. Our findings expose a full scope of the characteristics of Transformer inference on GPU servers with various input shapes and workload balancing degrees. We show that merging the online batch inference with robust scheduling schemes can improve the energy efficiency and the overall inference performance under latency constraints.

Original languageEnglish
Pages (from-to)1649-1659
Number of pages11
JournalIEEE Transactions on Green Communications and Networking
Issue number3
Early online date5 May 2022
Publication statusPublished - Sept 2022

Scopus Subject Areas

  • Computer Networks and Communications
  • Renewable Energy, Sustainability and the Environment

User-Defined Keywords

  • Artificial intelligence
  • Cloud computing
  • Energy conservation
  • GPU computing
  • Transformer


Dive into the research topics of 'Energy-efficient Online Scheduling of Transformer Inference Services on GPU Servers'. Together they form a unique fingerprint.

Cite this