DeSTA2: Developing Instruction-Following Speech Language Model Without Speech Instruction-Tuning Data

IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2024

28 January 2025

ArXiv (abs)PDF HTML HuggingFace (1 upvotes)Github

Papers citing "DeSTA2: Developing Instruction-Following Speech Language Model Without Speech Instruction-Tuning Data"

50 / 53 papers shown

MERaLiON-SER: Robust Speech Emotion Recognition Model for English and SEA Languages

...

Wong Heng Meng Jeremy

493

07 Nov 2025

Evaluating Emotion Recognition in Spoken Language Models on Emotionally Incongruent Speech

602

29 Oct 2025

SAKE: Towards Editing Auditory Attribute Knowledge of Large Audio-Language Models

...

220

19 Oct 2025

Investigating Safety Vulnerabilities of Large Audio-Language Models Under Speaker Emotional Variations

...

182

19 Oct 2025

TAU: A Benchmark for Cultural Sound Understanding Beyond Semantics

...

215

30 Sep 2025

Dual Information Speech Language Models for Emotional Conversations

123

11 Aug 2025

Incorporating Contextual Paralinguistic Understanding in Large Speech-Language Models

193

10 Aug 2025

SpeechIQ: Speech-Agentic Intelligence Quotient Across Cognitive Levels in Voice Understanding by Large Language ModelsAnnual Meeting of the Association for Computational Linguistics (ACL), 2025

...

222

25 Jul 2025

MCIF: Multimodal Crosslingual Instruction-Following Benchmark from Scientific Talks

358

25 Jul 2025

DIFFA: Large Language Diffusion Models Can Listen and Understand

...

294

24 Jul 2025

GOAT-SLM: A Spoken Language Model with Paralinguistic and Speaker Characteristic Awareness

...

338

24 Jul 2025

Reducing Object Hallucination in Large Audio-Language Models via Audio-Aware Decoding

477

08 Jun 2025

Beyond Classification: Towards Speech Emotion Reasoning with Multitask AudioLLMs

...

268

07 Jun 2025

AudioLens: A Closer Look at Auditory Attribute Perception of Large Audio-Language Models

436

05 Jun 2025

From Alignment to Advancement: Bootstrapping Audio-Language Alignment with Synthetic Data

Chun-Yi Kuan

Hung-yi Lee

AuLLM

369

26 May 2025

Speech-IFEval: Evaluating Instruction-Following and Quantifying Catastrophic Forgetting in Speech-Aware Language Models

351

25 May 2025

Towards Reliable Large Audio Language ModelAnnual Meeting of the Association for Computational Linguistics (ACL), 2025

...

281

25 May 2025

Analyzing Mitigation Strategies for Catastrophic Forgetting in End-to-End Training of Spoken Language Models

488

23 May 2025

Teaching Audio-Aware Large Language Models What Does Not Hear: Mitigating Hallucinations through Synthesized Negative Samples

Chun-Yi Kuan

Hung-yi Lee

391

20 May 2025

SAKURA: On the Multi-hop Reasoning of Large Audio-Language Models Based on Speech and Audio Information

630

19 May 2025

On The Landscape of Spoken Language Models: A Comprehensive Survey

475

106

11 Apr 2025

TASTE: Text-Aligned Speech Tokenization and Embedding for Spoken Language Modeling

572

09 Apr 2025

Adaptive Audio-Visual Speech Recognition via Matryoshka-Based Multimodal LLMs

Umberto Cappellazzo

Minsu Kim

Stavros Petridis

564

09 Mar 2025

Qwen2-Audio Technical Report

Yunfei Chu

Jin Xu

...

Chang Zhou

Jingren Zhou

AuLLM VLM

457

542

15 Jul 2024

Qwen2 Technical Report

Bowen Yu

...

Yuqiong Liu

Zeyu Cui

Zhenru Zhang

Zhifang Guo

Zhi-Wei Fan

OSLM VLM MU

765

2,002

15 Jul 2024

Speech-Copilot: Leveraging Large Language Models for Speech Processing via Task Decomposition, Modularization, and Program Generation

Hung-yi Lee

328

13 Jul 2024

DeSTA: Enhancing Speech Language Models through Descriptive Speech-Text Alignment

Ke-Han Lu

Zhehuai Chen

Szu-Wei Fu

He Huang

Boris Ginsburg

Yu-Chiang Frank Wang

Hung-yi Lee

VLM AuLLM

296

27 Jun 2024

BLSP-Emo: Towards Empathetic Large Speech-Language Models

320

06 Jun 2024

SpeechVerse: A Large-scale Generalizable Audio Language Model

...

580

14 May 2024

WavLLM: Towards Robust and Adaptive Speech Large Language Model

Shujie Hu

...

281

130

31 Mar 2024

AIR-Bench: Benchmarking Large Audio-Language Models via Generative Comprehension

Jin Xu

Yunfei Chu

...

Chang Zhou

Jingren Zhou

LM&MA AuLLM ALM

301

213

12 Feb 2024

emotion2vec: Self-Supervised Pre-Training for Speech Emotion Representation

Xie Chen

428

303

23 Dec 2023

Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models

Yunfei Chu

Jin Xu

Xiaohuan Zhou

Qian Yang

Shiliang Zhang

Zhijie Yan

Chang Zhou

Jingren Zhou

AuLLM

449

700

14 Nov 2023

AudioChatLlama: Towards General-Purpose Speech Abilities for LLMsNorth American Chapter of the Association for Computational Linguistics (NAACL), 2023

Junteng Jia

Ozlem Kalinli

326

12 Nov 2023

SALMONN: Towards Generic Hearing Abilities for Large Language Models

Changli Tang

Wenyi Yu

Guangzhi Sun

490

529

20 Oct 2023

Joint Audio and Speech UnderstandingAutomatic Speech Recognition & Understanding (ASRU), 2023

614

126

25 Sep 2023

Dynamic-SUPERB: Towards A Dynamic, Collaborative, and Comprehensive Instruction-Tuning Benchmark for SpeechIEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2023

...

Hung-yi Lee

418

18 Sep 2023

BLSP: Bootstrapping Language-Speech Pre-training via Behavior Alignment of Continuation Writing

460

02 Sep 2023

Direct Preference Optimization: Your Language Model is Secretly a Reward ModelNeural Information Processing Systems (NeurIPS), 2023

Christopher D. Manning

Chelsea Finn

ALM

1.1K

8,135

29 May 2023

Listen, Think, and UnderstandInternational Conference on Learning Representations (ICLR), 2023

834

241

18 May 2023

PaLM 2 Technical Report

...

963

1,475

17 May 2023

AudioGPT: Understanding and Generating Speech, Music, Sound, and Talking HeadAAAI Conference on Artificial Intelligence (AAAI), 2023

Rongjie Huang

Mingze Li

Dongchao Yang

Jiatong Shi

...

Zhou Zhao

285

376

25 Apr 2023

GPT-4 Technical Report

...

5.3K

23,506

15 Mar 2023

BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language ModelsInternational Conference on Machine Learning (ICML), 2023

Silvio Savarese

1.6K

7,784

30 Jan 2023

Robust Speech Recognition via Large-Scale Weak SupervisionInternational Conference on Machine Learning (ICML), 2022

1.4K

6,745

06 Dec 2022

PromptTTS: Controllable Text-to-Speech with Text DescriptionsIEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2022

233

175

22 Nov 2022

Brouhaha: multi-task training for voice activity detection, speech-to-noise ratio, and C50 room acoustics estimationAutomatic Speech Recognition & Understanding (ASRU), 2022

418

24 Oct 2022

DailyTalk: Spoken Dialogue Dataset for Conversational Text-to-SpeechIEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2022

566

03 Jul 2022

Chain-of-Thought Prompting Elicits Reasoning in Large Language ModelsNeural Information Processing Systems (NeurIPS), 2022

2.8K

17,183

28 Jan 2022

LoRA: Low-Rank Adaptation of Large Language ModelsInternational Conference on Learning Representations (ICLR), 2021

OffRL AI4TS AI4CE ALM AIMat

1.9K

17,979

17 Jun 2021