ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2505.03756
15
0

Improving the Serving Performance of Multi-LoRA Large Language Models via Efficient LoRA and KV Cache Management

19 April 2025
Hang Zhang
Jiuchen Shi
Yixiao Wang
Quan Chen
Yizhou Shan
Minyi Guo
ArXivPDFHTML
Abstract

Multiple Low-Rank Adapters (Multi-LoRAs) are gaining popularity for task-specific Large Language Model (LLM) applications. For multi-LoRA serving, caching hot KV caches and LoRA adapters in high bandwidth memory of accelerations can improve inference performance. However, existing Multi-LoRA inference systems fail to optimize serving performance like Time-To-First-Toke (TTFT), neglecting usage dependencies when caching LoRAs and KVs. We therefore propose FASTLIBRA, a Multi-LoRA caching system to optimize the serving performance. FASTLIBRA comprises a dependency-aware cache manager and a performance-driven cache swapper. The cache manager maintains the usage dependencies between LoRAs and KV caches during the inference with a unified caching pool. The cache swapper determines the swap-in or out of LoRAs and KV caches based on a unified cost model, when the HBM is idle or busy, respectively. Experimental results show that ELORA reduces the TTFT by 63.4% on average, compared to state-of-the-art works.

View on arXiv
@article{zhang2025_2505.03756,
  title={ Improving the Serving Performance of Multi-LoRA Large Language Models via Efficient LoRA and KV Cache Management },
  author={ Hang Zhang and Jiuchen Shi and Yixiao Wang and Quan Chen and Yizhou Shan and Minyi Guo },
  journal={arXiv preprint arXiv:2505.03756},
  year={ 2025 }
}
Comments on this paper