ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2503.04973
44
0

Beyond RAG: Task-Aware KV Cache Compression for Comprehensive Knowledge Reasoning

6 March 2025
Giulio Corallo
Orion Weller
Fabio Petroni
Paolo Papotti
    MQ
    VLM
ArXivPDFHTML
Abstract

Incorporating external knowledge in large language models (LLMs) enhances their utility across diverse applications, but existing methods have trade-offs. Retrieval-Augmented Generation (RAG) fetches evidence via similarity search, but key information may fall outside top ranked results. Long-context models can process multiple documents but are computationally expensive and limited by context window size. Inspired by students condensing study material for open-book exams, we propose task-aware key-value (KV) cache compression, which compresses external knowledge in a zero- or few-shot setup. This enables LLMs to reason efficiently over a compacted representation of all relevant information. Experiments show our approach outperforms both RAG and task-agnostic compression methods. On LongBench v2, it improves accuracy by up to 7 absolute points over RAG with a 30x compression rate, while reducing inference latency from 0.43s to 0.16s. A synthetic dataset highlights that RAG performs well when sparse evidence suffices, whereas task-aware compression is superior for broad knowledge tasks.

View on arXiv
@article{corallo2025_2503.04973,
  title={ Beyond RAG: Task-Aware KV Cache Compression for Comprehensive Knowledge Reasoning },
  author={ Giulio Corallo and Orion Weller and Fabio Petroni and Paolo Papotti },
  journal={arXiv preprint arXiv:2503.04973},
  year={ 2025 }
}
Comments on this paper