ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2507.02768
5
0

DeSTA2.5-Audio: Toward General-Purpose Large Audio Language Model with Self-Generated Cross-Modal Alignment

3 July 2025
Ke-Han Lu
Zhehuai Chen
Szu-Wei Fu
Chao-Han Huck Yang
Sung-Feng Huang
Chih-Kai Yang
Chee-En Yu
Chun-Wei Chen
Wei-Chih Chen
Chien-yu Huang
Yi-Cheng Lin
Yu-Xiang Lin
Chi-An Fu
Chun-Yi Kuan
Wenze Ren
Xuanjun Chen
Wei-Ping Huang
En-Pei Hu
Tzu-Quan Lin
Yuan-Kuei Wu
Kuan-Po Huang
Hsiao-Ying Huang
Huang-Cheng Chou
Kai-Wei Chang
Cheng-Han Chiang
Boris Ginsburg
Yu-Chiang Frank Wang
Hung-yi Lee
ArXiv (abs)PDFHTML
Main:10 Pages
3 Figures
Bibliography:4 Pages
Abstract

We introduce DeSTA2.5-Audio, a general-purpose Large Audio Language Model (LALM) designed for robust auditory perception and instruction-following, without requiring task-specific audio instruction-tuning. Recent LALMs typically augment Large Language Models (LLMs) with auditory capabilities by training on large-scale, manually curated or LLM-synthesized audio-instruction datasets. However, these approaches have often suffered from the catastrophic forgetting of the LLM's original language abilities. To address this, we revisit the data construction pipeline and propose DeSTA, a self-generated cross-modal alignment strategy in which the backbone LLM generates its own training targets. This approach preserves the LLM's native language proficiency while establishing effective audio-text alignment, thereby enabling zero-shot generalization without task-specific tuning. Using DeSTA, we construct DeSTA-AQA5M, a large-scale, task-agnostic dataset containing 5 million training samples derived from 7,000 hours of audio spanning 50 diverse datasets, including speech, environmental sounds, and music. DeSTA2.5-Audio achieves state-of-the-art or competitive performance across a wide range of audio-language benchmarks, including Dynamic-SUPERB, MMAU, SAKURA, Speech-IFEval, and VoiceBench. Comprehensive comparative studies demonstrate that our self-generated strategy outperforms widely adopted data construction and training strategies in both auditory perception and instruction-following capabilities. Our findings underscore the importance of carefully designed data construction in LALM development and offer practical insights for building robust, general-purpose LALMs.

View on arXiv
@article{lu2025_2507.02768,
  title={ DeSTA2.5-Audio: Toward General-Purpose Large Audio Language Model with Self-Generated Cross-Modal Alignment },
  author={ Ke-Han Lu and Zhehuai Chen and Szu-Wei Fu and Chao-Han Huck Yang and Sung-Feng Huang and Chih-Kai Yang and Chee-En Yu and Chun-Wei Chen and Wei-Chih Chen and Chien-yu Huang and Yi-Cheng Lin and Yu-Xiang Lin and Chi-An Fu and Chun-Yi Kuan and Wenze Ren and Xuanjun Chen and Wei-Ping Huang and En-Pei Hu and Tzu-Quan Lin and Yuan-Kuei Wu and Kuan-Po Huang and Hsiao-Ying Huang and Huang-Cheng Chou and Kai-Wei Chang and Cheng-Han Chiang and Boris Ginsburg and Yu-Chiang Frank Wang and Hung-yi Lee },
  journal={arXiv preprint arXiv:2507.02768},
  year={ 2025 }
}
Comments on this paper