ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2504.11197
25
0

Efficient Distributed Retrieval-Augmented Generation for Enhancing Language Model Performance

15 April 2025
S. Liu
Zhenzhe Zheng
Xiaoyao Huang
Fan Wu
Guihai Chen
Jie Wu
ArXivPDFHTML
Abstract

Small language models (SLMs) support efficient deployments on resource-constrained edge devices, but their limited capacity compromises inference performance. Retrieval-augmented generation (RAG) is a promising solution to enhance model performance by integrating external databases, without requiring intensive on-device model retraining. However, large-scale public databases and user-specific private contextual documents are typically located on the cloud and the device separately, while existing RAG implementations are primarily centralized. To bridge this gap, we propose DRAGON, a distributed RAG framework to enhance on-device SLMs through both general and personal knowledge without the risk of leaking document privacy. Specifically, DRAGON decomposes multi-document RAG into multiple parallel token generation processes performed independently and locally on the cloud and the device, and employs a newly designed Speculative Aggregation, a dual-side speculative algorithm to avoid frequent output synchronization between the cloud and device. A new scheduling algorithm is further introduced to identify the optimal aggregation side based on real-time network conditions. Evaluations on real-world hardware testbed demonstrate a significant performance improvement of DRAGON-up to 1.9x greater gains over standalone SLM compared to the centralized RAG, substantial reduction in per-token latency, and negligible Time to First Token (TTFT) overhead.

View on arXiv
@article{liu2025_2504.11197,
  title={ Efficient Distributed Retrieval-Augmented Generation for Enhancing Language Model Performance },
  author={ Shangyu Liu and Zhenzhe Zheng and Xiaoyao Huang and Fan Wu and Guihai Chen and Jie Wu },
  journal={arXiv preprint arXiv:2504.11197},
  year={ 2025 }
}
Comments on this paper