v1v2 (latest)

Large Language Model Can Transcribe Speech in Multi-Talker Scenarios with Versatile Instructions

IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2024

13 September 2024

Lingwei Meng

Shujie Hu

Jiawen Kang

Zhaoqing Li

Yuejiao Wang

Wenxuan Wu

Xixin Wu

Xunying Liu

Helen Meng

AuLLM

ArXiv (abs)PDF HTML HuggingFace (1 upvotes)

Papers citing "Large Language Model Can Transcribe Speech in Multi-Talker Scenarios with Versatile Instructions"

45 / 45 papers shown

ELEGANCE: Efficient LLM Guidance for Audio-Visual Target Speech Extraction

111

09 Nov 2025

M3-SLU: Evaluating Speaker-Attributed Reasoning in Multimodal Large Language Models

219

22 Oct 2025

Thinking in cocktail party: Chain-of-Thought and reinforcement learning for target speaker automatic speech recognition

19 Sep 2025

GLAD: Global-Local Aware Dynamic Mixture-of-Experts for Multi-Talker ASR

224

16 Sep 2025

UTI-LLM: A Personalized Articulatory-Speech Therapy Assistance System Based on Multimodal Large Language Model

124

16 Sep 2025

PAC: Pronunciation-Aware Contextualized Large Language Model-based Automatic Speech Recognition

133

16 Sep 2025

Serialized Output Prompting for Large Language Model-based Multi-Talker Speech Recognition

102

01 Sep 2025

Speaker Targeting via Self-Speaker Adaptation for Multi-talker ASR

202

27 Jun 2025

Incorporating Linguistic Constraints from External Knowledge Source for Audio-Visual Target Speech Extraction

278

11 Jun 2025

Towards Reliable Large Audio Language ModelAnnual Meeting of the Association for Computational Linguistics (ACL), 2025

...

237

25 May 2025

Survey of End-to-End Multi-Speaker Automatic Speech Recognition for Monaural Audio

Xinlu He

Jacob Whitehill

215

16 May 2025

WorldSimBench: Towards Video Generation Models as World Simulators

Xijun Wang

...

Wanli Ouyang

550

796

23 Oct 2024

Disentangling Speakers in Multi-Talker Speech Recognition with Speaker-Aware CTC

Jiawen Kang

Lingwei Meng

Mingyu Cui

Yuejiao Wang

Xixin Wu

Xunying Liu

Helen Meng

274

19 Sep 2024

The CHiME-8 DASR Challenge for Generalizable and Array Agnostic Distant Automatic Speech Recognition and Diarization

Shinji Watanabe

223

23 Jul 2024

Qwen2-Audio Technical Report

Yunfei Chu

Jin Xu

...

Chang Zhou

Jingren Zhou

AuLLM VLM

285

380

15 Jul 2024

Autoregressive Speech Synthesis without Vector Quantization

...

430

11 Jul 2024

Serialized Output Training by Learned Dominance

Ying Shi

142

04 Jul 2024

WavLLM: Towards Robust and Adaptive Speech Large Language Model

Shujie Hu

...

218

106

31 Mar 2024

Cross-Speaker Encoding Network for Multi-Talker Speech RecognitionIEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2024

Lingwei Meng

163

08 Jan 2024

Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models

Yunfei Chu

Jin Xu

Xiaohuan Zhou

Qian Yang

Shiliang Zhang

Zhijie Yan

Chang Zhou

Jingren Zhou

AuLLM

320

595

14 Nov 2023

SALMONN: Towards Generic Hearing Abilities for Large Language Models

Changli Tang

Wenyi Yu

Guangzhi Sun

360

438

20 Oct 2023

Kosmos-G: Generating Images in Context with Multimodal Large Language ModelsInternational Conference on Learning Representations (ICLR), 2023

552

04 Oct 2023

Conformer-based Target-Speaker Automatic Speech Recognition for Single-Channel AudioIEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2023

Yang Zhang

Krishna C. Puvvada

Vitaly Lavrukhin

Boris Ginsburg

164

09 Aug 2023

Llama 2: Open Foundation and Fine-Tuned Chat Models

Louis Martin

...

Sharan Narang

Sergey Edunov

8.2K

15,302

18 Jul 2023

Whisper-AT: Noise-Robust Automatic Speech Recognizers are Also Strong General Audio Event TaggersInterspeech (Interspeech), 2023

187

109

06 Jul 2023

Kosmos-2: Grounding Multimodal Large Language Models to the WorldInternational Conference on Learning Representations (ICLR), 2023

400

1,030

26 Jun 2023

SURT 2.0: Advances in Transducer-based Multi-talker Speech RecognitionIEEE/ACM Transactions on Audio Speech and Language Processing (TASLP), 2023

Desh Raj

Daniel Povey

Sanjeev Khudanpur

VLM

334

18 Jun 2023

End-to-End Joint Target and Non-Target Speakers ASRInterspeech (Interspeech), 2023

...

Atsushi Ando

110

04 Jun 2023

Unified Modeling of Multi-Talker Overlapped Speech Recognition and Diarization with a Sidecar SeparatorInterspeech (Interspeech), 2023

Lingwei Meng

Haibin Wu

157

25 May 2023

GPT-4 Technical Report

...

4.6K

20,902

15 Mar 2023

Language Is Not All You Need: Aligning Perception with Language ModelsNeural Information Processing Systems (NeurIPS), 2023

...

Xia Song

343

676

27 Feb 2023

LLaMA: Open and Efficient Foundation Language Models

...

6.4K

17,759

27 Feb 2023

A Sidecar Separator Can Convert a Single-Talker Speech Recognition System to a Multi-Talker OneIEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2023

Lingwei Meng

192

20 Feb 2023

Robust Speech Recognition via Large-Scale Weak SupervisionInternational Conference on Machine Learning (ICML), 2022

1.0K

5,793

06 Dec 2022

Adapting self-supervised models to multi-talker speech recognition using speaker embeddingsIEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2022

Zili Huang

Desh Raj

Leibny Paola García-Perera

Sanjeev Khudanpur

327

01 Nov 2022

Streaming Multi-Talker ASR with Token-Level Serialized Output TrainingInterspeech (Interspeech), 2022

407

02 Feb 2022

WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing

...

Jian Wu

1.2K

2,674

26 Oct 2021

LoRA: Low-Rank Adaptation of Large Language ModelsInternational Conference on Learning Representations (ICLR), 2021

OffRL AI4TS AI4CE ALM AIMat

1.6K

15,460

17 Jun 2021

End-to-End Speaker-Attributed ASR with TransformerInterspeech (Interspeech), 2021

183

05 Apr 2021

Streaming Multi-talker Speech Recognition with Joint Speaker IdentificationInterspeech (Interspeech), 2021

213

05 Apr 2021

Unsupervised Cross-lingual Representation Learning for Speech RecognitionInterspeech (Interspeech), 2020

368

919

24 Jun 2020

Serialized Output Training for End-to-End Overlapped Speech RecognitionInterspeech (Interspeech), 2020

228

144

28 Mar 2020

End-to-End Multi-speaker Speech Recognition with TransformerIEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2020

Wangyou Zhang

280

112

10 Feb 2020

Parameter-Efficient Transfer Learning for NLPInternational Conference on Machine Learning (ICML), 2019

N. Houlsby

A. Giurgiu

Stanislaw Jastrzebski

Bruna Morrone

Quentin de Laroussilhe

Andrea Gesmundo

Mona Attariyan

Sylvain Gelly

631

5,677

02 Feb 2019

Permutation Invariant Training of Deep Models for Speaker-Independent Multi-talker Speech SeparationIEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2016

348

917

01 Jul 2016