Accelerating Retrieval-Augmented Language Model Serving with Speculation

25 January 2024

ArXiv (abs)PDF HTML Github

Papers citing "Accelerating Retrieval-Augmented Language Model Serving with Speculation"

15 / 15 papers shown

Zero-RAG: Towards Retrieval-Augmented Generation with Zero Redundant Knowledge

400

01 Nov 2025

FastTTS: Accelerating Test-Time Scaling for Edge LLM Reasoning

258

29 Aug 2025

L-MTP: Leap Multi-Token Prediction Beyond Adjacent Context for Large Language Models

480

23 May 2025

Patchwork: A Unified Framework for RAG Serving

283

01 May 2025

Taming the Titans: A Survey of Efficient LLM Inference Serving

522

28 Apr 2025

Tutorial Proposal: Speculative Decoding for Efficient LLM Inference

395

01 Mar 2025

TeleRAG: Efficient Retrieval-Augmented Generation Inference with Lookahead Retrieval

...

1.1K

28 Feb 2025

DReSD: Dense Retrieval for Speculative DecodingAnnual Meeting of the Association for Computational Linguistics (ACL), 2025

610

21 Feb 2025

Vendi-RAG: Adaptively Trading-Off Diversity And Quality Significantly Improves Retrieval Augmented Generation With LLMs

Mohammad Reza Rezaei

Adji Bousso Dieng

VLM

616

16 Feb 2025

Deploying Foundation Model Powered Agent Services: A Survey

...

545

18 Dec 2024

AT-RAG: An Adaptive RAG Model Enhancing Query Efficiency with Topic Filtering and Iterative Reasoning

248

16 Oct 2024

RAGCache: Efficient Knowledge Caching for Retrieval-Augmented Generation

Xin Jin

496

18 Apr 2024

Unlocking Efficiency in Large Language Model Inference: A Comprehensive Survey of Speculative DecodingAnnual Meeting of the Association for Computational Linguistics (ACL), 2024

Heming Xia

Zhe Yang

Qingxiu Dong

Peiyi Wang

Zhifang Sui

564

240

15 Jan 2024

Lookahead: An Inference Acceleration Framework for Large Language Model with Lossless Generation Accuracy

391

20 Dec 2023

Billion-scale similarity search with GPUsIEEE Transactions on Big Data (TBD), 2017

Jeff Johnson

Matthijs Douze

Edouard Grave

1.3K

4,864

28 Feb 2017