HyGen: Efficient LLM Serving via Elastic Online-Offline Request Co-location

15 January 2025

Abstract

Large language models (LLMs) have facilitated a wide range of applications with distinct service-level objectives (SLOs), from latency-sensitive online tasks like interactive chatbots to throughput-oriented offline workloads like document summarization. The existing deployment model, which dedicates machines to each workload, simplifies SLO management but often leads to poor resource utilization. This paper introduces HyGen, an interference-aware LLM serving system that enables efficient co-location of online and offline workloads while preserving latency requirements. HyGen incorporates two key innovations: (1) performance control mechanisms, including a latency predictor to estimate batch execution time and an SLO-aware profiler to quantify latency interference, and (2) SLO-aware offline scheduling policies that maximize serving throughput and prevent starvation, without compromising online serving latency. Our evaluation on production workloads shows that HyGen achieves up to 3.87x overall throughput and 5.84x offline throughput gains over online and hybrid serving baselines, respectively, while strictly satisfying latency SLOs.

View on arXiv

@article{sun2025_2501.14808,
  title={ HyGen: Efficient LLM Serving via Elastic Online-Offline Request Co-location },
  author={ Ting Sun and Penghan Wang and Fan Lai },
  journal={arXiv preprint arXiv:2501.14808},
  year={ 2025 }
}

Comments on this paper