TY - UNPB
T1 - Towards Efficient and Reliable LLM Serving
T2 - A Real-World Workload Study
AU - Wang, Yuxin
AU - Chen, Yuhan
AU - Li, Zeyu
AU - Tang, Zhenheng
AU - Guo, Rui
AU - Wang, Xin
AU - Wang, Qiang
AU - Zhou, Amelie Chi
AU - Chu, Xiaowen
N1 - Funding information:
This work is done while Yuxin Wang was a research assistant at Hong Kong University of Science and Technology (Guangzhou). This work was partially supported by the National Natural Science Foundation of China under Grant No. 62272122, a Hong Kong RIF grant under Grant No. R6021-20, and Hong Kong CRF grants under Grant No. C2004-21G and C7004-22G.
PY - 2024/1/31
Y1 - 2024/1/31
N2 - Large language models (LLMs), especially Generative Pretrained Transformer (GPT) models, have significantly advanced in the industry in recent years. However, these models' broader development faces considerable challenges due to high operational and deployment costs. This has led to active research in improving the hardware efficiency of LLMs. Yet, the characteristics of real-world LLM workloads are often overlooked in current optimizations of LLM serving systems. In this work, the absence of reliable workload data for evaluating LLM serving systems impacts the quality of service (QoS) and reliability in industrial deployments. This paper introduces the first real-world trace dataset of LLM serving workloads, detailing user, system, and LLM behaviors. We analyze this trace, highlighting burstiness, request and response distributions, and focusing on the reliability of GPT services. Based on this, we have developed a benchmark suite that reflects our dataset's workload patterns, enabling performance evaluation of serving systems. This suite captures the core patterns of workload distributions, allowing for precise scaling of the workload dataset to match system sizes. Our evaluation uncovers a previously unrecognized vulnerability of LLM serving systems to short-term burstiness, particularly in common workload scenarios. We observe that GPU memory limitations, caused by the fluctuating nature of burstiness, lead to significant performance degradation in existing LLM serving systems. Beyond benchmarking, understanding these patterns is valuable for optimizing LLM workload management, enabling elastic hardware resource adjustments to varying workloads. To encourage further research, we have made the dataset and benchmark suite publicly available at https://github.com/HPMLL/BurstGPT.
AB - Large language models (LLMs), especially Generative Pretrained Transformer (GPT) models, have significantly advanced in the industry in recent years. However, these models' broader development faces considerable challenges due to high operational and deployment costs. This has led to active research in improving the hardware efficiency of LLMs. Yet, the characteristics of real-world LLM workloads are often overlooked in current optimizations of LLM serving systems. In this work, the absence of reliable workload data for evaluating LLM serving systems impacts the quality of service (QoS) and reliability in industrial deployments. This paper introduces the first real-world trace dataset of LLM serving workloads, detailing user, system, and LLM behaviors. We analyze this trace, highlighting burstiness, request and response distributions, and focusing on the reliability of GPT services. Based on this, we have developed a benchmark suite that reflects our dataset's workload patterns, enabling performance evaluation of serving systems. This suite captures the core patterns of workload distributions, allowing for precise scaling of the workload dataset to match system sizes. Our evaluation uncovers a previously unrecognized vulnerability of LLM serving systems to short-term burstiness, particularly in common workload scenarios. We observe that GPU memory limitations, caused by the fluctuating nature of burstiness, lead to significant performance degradation in existing LLM serving systems. Beyond benchmarking, understanding these patterns is valuable for optimizing LLM workload management, enabling elastic hardware resource adjustments to varying workloads. To encourage further research, we have made the dataset and benchmark suite publicly available at https://github.com/HPMLL/BurstGPT.
KW - Large Language Models
KW - Generative Pretrained Transformer
KW - Batch Inference
KW - GPU Serving
KW - Bursty Workloads
KW - Benchmarking
KW - Quality of Service
U2 - 10.48550/arXiv.2401.17644
DO - 10.48550/arXiv.2401.17644
M3 - Preprint
T3 - arXiv
SP - 1
EP - 12
BT - Towards Efficient and Reliable LLM Serving
PB - Cornell University
ER -