ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2502.15178
55
0

Enhancing Speech Large Language Models with Prompt-Aware Mixture of Audio Encoders

24 February 2025
Weiqiao Shan
Y. Li
Yuhao Zhang
Yingfeng Luo
Chen Xu
X. Zhao
Long Meng
Y. Lu
M. Zhang
Hao Yang
Tong Xiao
Jingbo Zhu
    AuLLM
ArXivPDFHTML
Abstract

Connecting audio encoders with large language models (LLMs) allows the LLM to perform various audio understanding tasks, such as automatic speech recognition (ASR) and audio captioning (AC). Most research focuses on training an adapter layer to generate a unified audio feature for the LLM. However, different tasks may require distinct features that emphasize either semantic or acoustic aspects, making task-specific audio features more desirable. In this paper, we propose Prompt-aware Mixture (PaM) to enhance the Speech LLM that uses multiple audio encoders. Our approach involves using different experts to extract different features based on the prompt that indicates different tasks. Experiments demonstrate that with PaM, only one Speech LLM surpasses the best performances achieved by all single-encoder Speech LLMs on ASR, Speaker Number Verification, and AC tasks. PaM also outperforms other feature fusion baselines, such as concatenation and averaging.

View on arXiv
@article{shan2025_2502.15178,
  title={ Enhancing Speech Large Language Models with Prompt-Aware Mixture of Audio Encoders },
  author={ Weiqiao Shan and Yuang Li and Yuhao Zhang and Yingfeng Luo and Chen Xu and Xiaofeng Zhao and Long Meng and Yunfei Lu and Min Zhang and Hao Yang and Tong Xiao and Jingbo Zhu },
  journal={arXiv preprint arXiv:2502.15178},
  year={ 2025 }
}
Comments on this paper