SonicRAG : High Fidelity Sound Effects Synthesis Based on Retrival Augmented Generation

Large Language Models (LLMs) have demonstrated remarkable capabilities in natural language processing (NLP) and multimodal learning, with successful applications in text generation and speech synthesis, enabling a deeper understanding and generation of multimodal content. In the field of sound effects (SFX) generation, LLMs have been leveraged to orchestrate multiple models for audio synthesis. However, due to the scarcity of annotated datasets, and the complexity of temproal modeling. current SFX generation techniques still fall short in achieving high-fidelity audio. To address these limitations, this paper introduces a novel framework that integrates LLMs with existing sound effect databases, allowing for the retrieval, recombination, and synthesis of audio based on user requirements. By leveraging this approach, we enhance the diversity and quality of generated sound effects while eliminating the need for additional recording costs, offering a flexible and efficient solution for sound design and application.
View on arXiv@article{guo2025_2505.03244, title={ SonicRAG : High Fidelity Sound Effects Synthesis Based on Retrival Augmented Generation }, author={ Yu-Ren Guo and Wen-Kai Tai }, journal={arXiv preprint arXiv:2505.03244}, year={ 2025 } }