AugServe: Adaptive Request Scheduling for Augmented Large Language Model Inference Serving

3 December 2025

Ying Wang

Zhen Jin

Jiexiong Xu

Wenhai Lin

Yiquan Chen

Wenzhi Chen

RALM

AI4TS

ArXiv (abs)PDF HTML

Main:9 Pages

16 Figures

Bibliography:3 Pages

9 Tables

Appendix:3 Pages

Abstract

As augmented large language models (LLMs) with external tools become increasingly popular in web applications, improving augmented LLM inference serving efficiency and optimizing service-level objectives (SLOs) are critical for enhancing user experience. To achieve this, inference systems must maximize request handling within latency constraints, referred to as increasing effective throughput. However, existing systems face two major challenges: (i) reliance on first-come-first-served (FCFS) scheduling causes severe head-of-line blocking, leading to queuing delays exceeding the SLOs for many requests; and (ii) static batch token limit, which fails to adapt to fluctuating loads and hardware conditions. Both of these factors degrade effective throughput and service quality.

View on arXiv

Comments on this paper