Abstract
Inference-as-a-service (IAAS) has been recently launched by cloud service providers to support on-demand AI applications. Many natural language processing (NLP) services are based on the Transformer Sequence Transduction model. However, the inference process of the Transformer model consumes a significant amount of energy due to the large model size (e.g., billions of parameters) and tremendous computations. How to reduce the energy consumption of IAAS without violating the service-level agreement (SLA) becomes a practical challenge for service providers. In this work, we conduct a comprehensive study on the inference performance and energy efficiency of a Transformer model trained for the language translation service. First, we empirically characterize some essential performance metrics, including latency, throughput, and energy consumption on three different GPUs with diversified workload configurations. The detailed workload separation facilitates a thorough and deep understanding of the inference process of the Transformer model. Second, we provide an energy consumption model for the Transformer based on the observed data. Finally, we propose the Aligned scheduling scheme that optimizes throughput and energy efficiency with up to 2.86× and 2.73× improvement at the cost of 40% average latency loss. Our findings provide a full scope of Transformer inference, and suggest that the workload balancing and scheduling have great potentials to offer energy-efficient Transformer inference services.
Original language | English |
---|---|
Title of host publication | Proceedings - IEEE Congress on Cybermatics |
Subtitle of host publication | 2020 IEEE International Conferences on Internet of Things, iThings 2020, IEEE Green Computing and Communications, GreenCom 2020, IEEE Cyber, Physical and Social Computing, CPSCom 2020 and IEEE Smart Data, SmartData 2020 |
Publisher | IEEE |
Pages | 323-331 |
Number of pages | 9 |
ISBN (Electronic) | 9781728176475 |
DOIs | |
Publication status | Published - Nov 2020 |
Event | 2020 IEEE Congress on Cybermatics: 13th IEEE International Conferences on Internet of Things, iThings 2020, 16th IEEE International Conference on Green Computing and Communications, GreenCom 2020, 13th IEEE International Conference on Cyber, Physical and Social Computing, CPSCom 2020 and 6th IEEE International Conference on Smart Data, SmartData 2020 - Rhodes Island, Greece Duration: 2 Nov 2020 → 6 Nov 2020 |
Publication series
Name | Proceedings - IEEE Congress on Cybermatics: 2020 IEEE International Conferences on Internet of Things, iThings 2020, IEEE Green Computing and Communications, GreenCom 2020, IEEE Cyber, Physical and Social Computing, CPSCom 2020 and IEEE Smart Data, SmartData 2020 |
---|
Conference
Conference | 2020 IEEE Congress on Cybermatics: 13th IEEE International Conferences on Internet of Things, iThings 2020, 16th IEEE International Conference on Green Computing and Communications, GreenCom 2020, 13th IEEE International Conference on Cyber, Physical and Social Computing, CPSCom 2020 and 6th IEEE International Conference on Smart Data, SmartData 2020 |
---|---|
Country/Territory | Greece |
City | Rhodes Island |
Period | 2/11/20 → 6/11/20 |
Scopus Subject Areas
- Artificial Intelligence
- Computer Networks and Communications
- Hardware and Architecture
- Information Systems and Management
- Renewable Energy, Sustainability and the Environment
- Communication
User-Defined Keywords
- Batch Inference
- Cloud Service
- Energy Efficiency
- Graphics Processing Units
- Inference Scheduling
- Transformer Model
Access to Document
Other files and links
Fingerprint
Dive into the research topics of 'Energy-efficient Inference Service of Transformer-based Deep Learning Models on GPUs'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver
}
Proceedings - IEEE Congress on Cybermatics: 2020 IEEE International Conferences on Internet of Things, iThings 2020, IEEE Green Computing and Communications, GreenCom 2020, IEEE Cyber, Physical and Social Computing, CPSCom 2020 and IEEE Smart Data, SmartData 2020. IEEE, 2020. p. 323-331 9291633 (Proceedings - IEEE Congress on Cybermatics: 2020 IEEE International Conferences on Internet of Things, iThings 2020, IEEE Green Computing and Communications, GreenCom 2020, IEEE Cyber, Physical and Social Computing, CPSCom 2020 and IEEE Smart Data, SmartData 2020).
Research output: Chapter in book/report/conference proceeding › Conference proceeding › peer-review
TY - GEN
T1 - Energy-efficient Inference Service of Transformer-based Deep Learning Models on GPUs
AU - Wang, Yuxin
AU - Wang, Qiang
AU - Chu, Xiaowen
N1 - Funding Information: 3) Decrease g when 1 ≤ g ≤ gm and p = 0 on this g. B. Experimental Results We conduct comprehensive scheduling experiments to evaluate the performance of Aligned scheme by repeating each test 5 times, 500 iterations in total. The average throughput, latency, and total energy efficiency (ratio) are calculated for comparison. We list our experimental results of scheduling in Figure 5. As we can see from Figure 5, in terms of latency, Aligned scheme achieves only 40% latency loss compared with mixture, and 11.3% lower latency loss compared with MPS + Re-Batching without scheduling. In terms of throughput, Aligned scheme is overall 2.86× better than Mixture, and 2.98× on top when the concurrence is 75%. Most importantly, the energy efficiency of Aligned scheme is much higher than Mixture at 2.73× on average. The experimental results indicate that by applying Re-batching, MPS, and scheduling strategies, we can greatly improve the energy efficiency of the Transformer inference server and maintain higher throughput, at the cost of the small latency sacrifice. VII. FUTURE WORK There are two potential research directions revealed in our study. First, since the long sequences can significantly cut down both the throughput and energy efficiency, efficient parallel algorithms and implementations of Transformer inference on GPUs should be developed. Second, our experimental results can be used to design real-world energy-efficient and high-performance AI inference for GPU clusters by properly re-batching and scheduling the arriving requests. If we use the real-world input system instead of simulation, it is not hard to predict that there should be more robust controlling rules for the scheduling system. VIII. CONCLUSION This paper investigates how the inference latency, throughput, and energy consumption of the Transformer model are influenced by different factors, including batch 2.5 2 1.5 1 0.5 0 Mixture Re-Batching Re-Batching+MPS ?lign 25% 50% 75% 100% Concurrence (a) Comparison of latency of different strategies 3.5 3 2.5 2 1.5 1 0.5 0 Mixture Re-Batching Re-Batching+MPS ?lign 25% 50% 75% 100% Concurrence (b) Comparison of throughput of different strategies 3 2.5 2 1.5 1 0.5 0 Mixture Re-Batching Re-Batching+MPS ?lign Energy?Efficiency(ratio) Throughput(ratio) 25% 50% 75% 100% Concurrence (c) Comparison of energy efficiency of different strategies Fig. 5: The effectiveness of Aligned Scheduling Scheme (Align) vs Mixture, Re-Batching and Re-Batching + MPS in terms of throughput, latency and energy efficiency. size, sequence length, and length distribution in one batch. We build a model to describe the Transformer’s energy performance on different GPUs affected by the above factors and MPS. Our observations suggest the GPU servers batch those sequences with the same or similar length to improve both the inference throughput and energy efficiency. It is also reported that our Aligned Scheduling Scheme much helps improve both the inference throughput and energy efficiency of the Transformer model. However, since the GPU implementation of the Transformer still does not do well in parallelism in autoregressive decoding, long sequences can bring down both the inference performance and energy efficiency. Our research group will conduct more research on this issue in the future. The experimental results and findings of this paper Latency(ratio) can help develop high-performance and energy-efficient online batching and scheduling algorithms for sequence inference by the Transformer model on GPUs. ACKNOWLEDGEMENTS The research was supported by the Hong Kong RGC GRF grant HKBU 12200418. REFERENCES [1] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko-reit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. CoRR, abs/1706.03762, 2017. [2] Yonghui Wu et al. Google’s neural machine translation system: Bridging the gap between human and machine translation. CoRR, abs/1609.08144, 2016. [3] Chung-Cheng Chiu, Tara N. Sainath, Yonghui Wu, Rohit Prabhavalkar, Patrick Nguyen, Zhifeng Chen, Anjuli Kannan, Ron J. Weiss, Kanishka Rao, Katya Gonina, Navdeep Jaitly, Bo Li, Jan Chorowski, and Michiel Bacchiani. State-of-the-art speech recognition with sequence-to-sequence models. CoRR, abs/1712.01769, 2017. [4] Artit Wangperawong. Attending to mathematical language with transformers. CoRR, abs/1812.02825, 2018. [5] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina T outanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018. [6] Microsoft. Microsoft translation service. [online]https://www.microsoft.com/en-us/translator/, 2019. [7] Microsoft Corporation. Applications that scale using gpu compute. [Online] https://azure.microsoft.com/en-us/documentation/videos/azurecon-2015-applications-that-scale-using-gpu-compute/. [8] Bing Platform Jeffrey Zhu Program Manager. Bing delivers its largest improvement in search experience using Azure GPUs. [online]https://azure.microsoft.com/en-us/blog, 2020. [9] Google. Google translation service. [online] https://cloud.google.com/translate/ docs/translating-text/, 2019. [10] Baidu translation service. [online]https://fanyi-api.baidu.com/. [11] Aws. Aws translation service. [on- line]https://docs.aws.amazon.com/, 2020. [12] Huggingface. Huggingface [Online]https://medium.com/huggingface/ benchmarking- transformers-pytorch-and-tensorflow-e2917fb891c2, 2019. Accessed: 2019-09. [13] MLPerf. MLPerf. https://mlperf.org, 2019. Accessed: 2019-09. [14] Y. Wang, Q. Wang, S. Shi, X. He, Z. Tang, K. Zhao, and X. Chu. Benchmarking the performance and energy efficiency of ai accelerators for ai training. In 2020 20th IEEE/ACM Inter-national Symposium on Cluster, Cloud and Internet Computing (CCGRID) , pages 744–751, 2020. [15] Qiang Wang, Pengfei Xu, Yatao Zhang, and Xiaowen Chu. Eppminer: An extended benchmark suite for energy, power and performance characterization of heterogeneous architecture. In the Eighth International Conference on Future Energy Systems, e-Energy ’17, page 23–33, New York, NY, USA, 2017. Association for Computing Machinery. Benchmark. [16] Zhenheng Tang, Yuxin Wang, Qiang Wang, and Xiaowen Chu. The impact of gpu dvfs on the energy and performance of deep learning: An empirical study. In Proceedings of the Tenth ACM International Conference on Future Energy Systems, e-Energy ’19, page 315–325, New York, NY, USA, 2019. Association for Computing Machinery. [17] Q. Wang and X. Chu. Gpgpu performance estimation with core and memory frequency scaling. IEEE Transactions on Parallel and Distributed Systems, 31(12):2865–2881, 2020. [18] Xinxin Mei, Qiang Wang, and Xiaowen Chu. A survey and measurement study of gpu dvfs on energy conservation. Digital Communications and Networks, 3(2):89 – 100, 2017. [19] Vincent Chau, Xiaowen Chu, Hai Liu, and Yiu-Wing Leung. En-ergy efficient job scheduling with dvfs for cpu-gpu heterogeneous systems. In Proceedings of the Eighth International Conference on Future Energy Systems, e-Energy ’17, page 1–11, New York, NY, USA, 2017. Association for Computing Machinery. [20] X. Mei, X. Chu, H. Liu, Y. Leung, and Z. Li. Energy efficient real-time task scheduling on cpu-gpu hybrid clusters. In IEEE INFOCOM 2017 - IEEE Conference on Computer Communications , pages 1–9, 2017. [21] NVIDIA. Multi-process service. [EB/OL]. https://docs.nvidia. com/deploy/pdf/CUDA_Multi_Process_Service_Overview.pdf Accessed June, 2020. [22] Aishwarya Bhandare, Vamsi Sripathi, Deepthi Karkada, Vivek Menon, Sun Choi, Kushal Datta, and Vikram Saletore. Efficient 8-bit quantization of transformer neural machine language translation model. CoRR, abs/1906.00532, 2019. [23] Pin Gao, Lingfan Yu, Yongwei Wu, and Jinyang Li. Low latency rnn inference with cellular batching. In EuroSys ’18, 2018. [24] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina T outanova. BERT: pre-training of deep bidirectional transformers for language understanding. CoRR, abs/1810.04805, 2018. [25] Kayvon Fatahalian, Jeremy Sugerman, and Pat Hanrahan. Un-derstanding the efficiency of gpu algorithms for matrix-matrix multiplication. pages 133–137, 01 2004. [26] NVIDIA. Volta Whitebook. [Online] http://images.nvidia.com/content/volta-architecture/pdf/volta-architecture-whitepaper.pdf/, 2018. [27] NVIDIA. NVIDIA Tensor Core. [Online]https://devblogs.nvidia.com/programming-tensor-cores-cuda-9/, 2017. [28] Pandu Nayak. Understanding searches better than ever before. [Online]https://www.blog.google/products/search/ search-language-understanding-bert/, 2019. [29] Meituan. The Practice of Bert. [Online]https://tech.meituan. com/2019/11/14/nlp-bert-practice.html, 2019. [30] NVIDIA. Faster Transformer. https://github.com/NVIDIA/ DeepLearningExamples/tree/master/FasterTransformer, 2019. Accessed: 2019-09. [31] NVIDIA. TensorRT 7 for Inference. [Online]https://news. developer.nvidia.com/tensorrt-7-conversational-ai/, 2019. [32] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko-reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in neural information processing systems, pages 5998–6008, 2017. [33] Yingce Xia, Tianyu He, Xu Tan, Fei Tian, Di He, and Tao Qin. Tied transformers: Neural machine translation with shared encoder and decoder. Proceedings of the AAAI Conference on Artificial Intelligence, 33:5466–5473, 07 2019. [34] NVIDIA. NVIDIA Management Library . [Online]https:// developer.nvidia.com/nvidia-management-library-nvml, 2018. Funding Information: The research was supported by the Hong Kong RGC GRF grant HKBU 12200418.
PY - 2020/11
Y1 - 2020/11
N2 - Inference-as-a-service (IAAS) has been recently launched by cloud service providers to support on-demand AI applications. Many natural language processing (NLP) services are based on the Transformer Sequence Transduction model. However, the inference process of the Transformer model consumes a significant amount of energy due to the large model size (e.g., billions of parameters) and tremendous computations. How to reduce the energy consumption of IAAS without violating the service-level agreement (SLA) becomes a practical challenge for service providers. In this work, we conduct a comprehensive study on the inference performance and energy efficiency of a Transformer model trained for the language translation service. First, we empirically characterize some essential performance metrics, including latency, throughput, and energy consumption on three different GPUs with diversified workload configurations. The detailed workload separation facilitates a thorough and deep understanding of the inference process of the Transformer model. Second, we provide an energy consumption model for the Transformer based on the observed data. Finally, we propose the Aligned scheduling scheme that optimizes throughput and energy efficiency with up to 2.86× and 2.73× improvement at the cost of 40% average latency loss. Our findings provide a full scope of Transformer inference, and suggest that the workload balancing and scheduling have great potentials to offer energy-efficient Transformer inference services.
AB - Inference-as-a-service (IAAS) has been recently launched by cloud service providers to support on-demand AI applications. Many natural language processing (NLP) services are based on the Transformer Sequence Transduction model. However, the inference process of the Transformer model consumes a significant amount of energy due to the large model size (e.g., billions of parameters) and tremendous computations. How to reduce the energy consumption of IAAS without violating the service-level agreement (SLA) becomes a practical challenge for service providers. In this work, we conduct a comprehensive study on the inference performance and energy efficiency of a Transformer model trained for the language translation service. First, we empirically characterize some essential performance metrics, including latency, throughput, and energy consumption on three different GPUs with diversified workload configurations. The detailed workload separation facilitates a thorough and deep understanding of the inference process of the Transformer model. Second, we provide an energy consumption model for the Transformer based on the observed data. Finally, we propose the Aligned scheduling scheme that optimizes throughput and energy efficiency with up to 2.86× and 2.73× improvement at the cost of 40% average latency loss. Our findings provide a full scope of Transformer inference, and suggest that the workload balancing and scheduling have great potentials to offer energy-efficient Transformer inference services.
KW - Batch Inference
KW - Cloud Service
KW - Energy Efficiency
KW - Graphics Processing Units
KW - Inference Scheduling
KW - Transformer Model
UR - http://www.scopus.com/inward/record.url?scp=85099482577&partnerID=8YFLogxK
U2 - 10.1109/iThings-GreenCom-CPSCom-SmartData-Cybermatics50389.2020.00067
DO - 10.1109/iThings-GreenCom-CPSCom-SmartData-Cybermatics50389.2020.00067
M3 - Conference proceeding
AN - SCOPUS:85099482577
T3 - Proceedings - IEEE Congress on Cybermatics: 2020 IEEE International Conferences on Internet of Things, iThings 2020, IEEE Green Computing and Communications, GreenCom 2020, IEEE Cyber, Physical and Social Computing, CPSCom 2020 and IEEE Smart Data, SmartData 2020
SP - 323
EP - 331
BT - Proceedings - IEEE Congress on Cybermatics
PB - IEEE
T2 - 2020 IEEE Congress on Cybermatics: 13th IEEE International Conferences on Internet of Things, iThings 2020, 16th IEEE International Conference on Green Computing and Communications, GreenCom 2020, 13th IEEE International Conference on Cyber, Physical and Social Computing, CPSCom 2020 and 6th IEEE International Conference on Smart Data, SmartData 2020
Y2 - 2 November 2020 through 6 November 2020
ER -