VectorLiteRAG: Latency-Aware and Fine-Grained Resource Partitioning for Efficient RAG

11 April 2025

Joo-Young Kim

Divya Mahajan

VLM

ArXiv (abs)PDF HTML Github

Main:13 Pages

19 Figures

Bibliography:2 Pages

Abstract

Retrieval-Augmented Generation (RAG) systems combine vector similarity search with large language models (LLMs) to deliver accurate, context-aware responses. However, co-locating the vector retriever and the LLM on shared GPU infrastructure introduces significant challenges: vector search is memory and I/O intensive, while LLM inference demands high throughput and low latency. Naive resource sharing often leads to severe performance degradation, particularly under high request load or large index sizes.

View on arXiv

Comments on this paper