ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2406.17565
30
12

MemServe: Context Caching for Disaggregated LLM Serving with Elastic Memory Pool

25 June 2024
Cunchen Hu
Heyang Huang
Junhao Hu
Jiang Xu
Xusheng Chen
Tao Xie
Chenxi Wang
Sa Wang
Yungang Bao
Ninghui Sun
Yizhou Shan
    LLMAG
ArXivPDFHTML
Abstract

Large language model (LLM) serving has transformed from stateless to stateful systems, utilizing techniques like context caching and disaggregated inference. These optimizations extend the lifespan and domain of the KV cache, necessitating a new architectural approach. We present MemServe, a unified system that integrates both inter-request and intra-request optimizations. MemServe introduces MemPool, an elastic memory pool managing distributed memory and KV caches across serving instances. Using MemPool APIs, MemServe combines context caching with disaggregated inference for the first time, supported by a global scheduler that enhances cache reuse through a global prompt tree-based locality-aware policy. Tests show that MemServe significantly improves job completion time and time-to-first-time.

View on arXiv
Comments on this paper