VectorLiteRAG: Latency-Aware and Fine-Grained Resource Partitioning for Efficient RAG
- VLM
Main:13 Pages
19 Figures
Bibliography:2 Pages
Abstract
Retrieval-Augmented Generation (RAG) systems combine vector similarity search with large language models (LLMs) to deliver accurate, context-aware responses. However, co-locating the vector retriever and the LLM on shared GPU infrastructure introduces significant challenges: vector search is memory and I/O intensive, while LLM inference demands high throughput and low latency. Naive resource sharing often leads to severe performance degradation, particularly under high request load or large index sizes.
View on arXivComments on this paper
