TY - JOUR
T1 - Energy-efficient Online Scheduling of Transformer Inference Services on GPU Servers
AU - Wang, Yuxin
AU - Wang, Qiang
AU - Chu, Xiaowen
N1 - Funding Information:
This work was supported by Hong Kong RGC GRF under Grant HKBU 12200418.
Publisher Copyright:
© 2022 IEEE.
PY - 2022/9
Y1 - 2022/9
N2 - Cloud service providers are deploying Transformer-based deep learning models on GPU servers to support many online inference-as-a-service (IAAS) applications, given the predominant performance of Transformers in natural language processing (NLP) tasks. However, Transformers' inherent high complexity and large model size (e.g., billions to hundreds of billions of parameters) tax the resource-constrained GPU servers. Improving the energy efficiency and payload capability of IAAS without violating the service-level agreement (SLA) becomes a practical challenge for service providers. This work conducts a comprehensive study on the inference performance and energy efficiency of Transformer models. First, we empirically characterize essential performance metrics, including latency, throughput, and energy consumption on NVIDIA GPUs under various workload configurations. Second, we establish a performance and energy consumption model for Transformer that facilitates energy-efficient scheduling policies. Finally, we propose an online batch inference scheduling scheme for Transformer on GPU servers, which we refer to as the Mixed Aligned Scheduling (MAS) scheme. Compared with the existing scheduling schemes, the MAS improves the throughput and energy efficiency by up to 61.56% and 69.79% on the V100 GPU servers. Our findings expose a full scope of the characteristics of Transformer inference on GPU servers with various input shapes and workload balancing degrees. We show that merging the online batch inference with robust scheduling schemes can improve the energy efficiency and the overall inference performance under latency constraints.
AB - Cloud service providers are deploying Transformer-based deep learning models on GPU servers to support many online inference-as-a-service (IAAS) applications, given the predominant performance of Transformers in natural language processing (NLP) tasks. However, Transformers' inherent high complexity and large model size (e.g., billions to hundreds of billions of parameters) tax the resource-constrained GPU servers. Improving the energy efficiency and payload capability of IAAS without violating the service-level agreement (SLA) becomes a practical challenge for service providers. This work conducts a comprehensive study on the inference performance and energy efficiency of Transformer models. First, we empirically characterize essential performance metrics, including latency, throughput, and energy consumption on NVIDIA GPUs under various workload configurations. Second, we establish a performance and energy consumption model for Transformer that facilitates energy-efficient scheduling policies. Finally, we propose an online batch inference scheduling scheme for Transformer on GPU servers, which we refer to as the Mixed Aligned Scheduling (MAS) scheme. Compared with the existing scheduling schemes, the MAS improves the throughput and energy efficiency by up to 61.56% and 69.79% on the V100 GPU servers. Our findings expose a full scope of the characteristics of Transformer inference on GPU servers with various input shapes and workload balancing degrees. We show that merging the online batch inference with robust scheduling schemes can improve the energy efficiency and the overall inference performance under latency constraints.
KW - Artificial intelligence
KW - Cloud computing
KW - Energy conservation
KW - GPU computing
KW - Transformer
UR - https://www.scopus.com/pages/publications/85132514300
U2 - 10.1109/TGCN.2022.3171680
DO - 10.1109/TGCN.2022.3171680
M3 - Journal article
SN - 2473-2400
VL - 6
SP - 1649
EP - 1659
JO - IEEE Transactions on Green Communications and Networking
JF - IEEE Transactions on Green Communications and Networking
IS - 3
ER -