Towards Efficient and Reliable LLM Serving: A Real-World Workload Study

Yuxin Wang, Yuhan Chen, Zeyu Li, Zhenheng Tang, Rui Guo, Xin Wang, Qiang Wang, Amelie Chi Zhou, Xiaowen Chu*

*Corresponding author for this work

Research output: Working paperPreprint

Abstract

Large language models (LLMs), especially Generative Pretrained Transformer (GPT) models, have significantly advanced in the industry in recent years. However, these models' broader development faces considerable challenges due to high operational and deployment costs. This has led to active research in improving the hardware efficiency of LLMs. Yet, the characteristics of real-world LLM workloads are often overlooked in current optimizations of LLM serving systems. In this work, the absence of reliable workload data for evaluating LLM serving systems impacts the quality of service (QoS) and reliability in industrial deployments. This paper introduces the first real-world trace dataset of LLM serving workloads, detailing user, system, and LLM behaviors. We analyze this trace, highlighting burstiness, request and response distributions, and focusing on the reliability of GPT services. Based on this, we have developed a benchmark suite that reflects our dataset's workload patterns, enabling performance evaluation of serving systems. This suite captures the core patterns of workload distributions, allowing for precise scaling of the workload dataset to match system sizes. Our evaluation uncovers a previously unrecognized vulnerability of LLM serving systems to short-term burstiness, particularly in common workload scenarios. We observe that GPU memory limitations, caused by the fluctuating nature of burstiness, lead to significant performance degradation in existing LLM serving systems. Beyond benchmarking, understanding these patterns is valuable for optimizing LLM workload management, enabling elastic hardware resource adjustments to varying workloads. To encourage further research, we have made the dataset and benchmark suite publicly available at https://github.com/HPMLL/BurstGPT.
Original languageEnglish
PublisherCornell University
Pages1-12
Number of pages12
DOIs
Publication statusPublished - 31 Jan 2024

Publication series

NamearXiv
PublisherCornell University

User-Defined Keywords

  • Large Language Models
  • Generative Pretrained Transformer
  • Batch Inference
  • GPU Serving
  • Bursty Workloads
  • Benchmarking
  • Quality of Service

Fingerprint

Dive into the research topics of 'Towards Efficient and Reliable LLM Serving: A Real-World Workload Study'. Together they form a unique fingerprint.

Cite this