v1v2v3 (latest)

BAT: Learning to Reason about Spatial Sounds with Large Language Models

2 February 2024

ArXiv (abs)PDF HTML Github

Papers citing "BAT: Learning to Reason about Spatial Sounds with Large Language Models"

50 / 58 papers shown

AudioMotionBench: Evaluating Auditory Motion Perception in Audio LLMs

467

17 Nov 2025

Multimodal Spatial Reasoning in the Large Model Era: A Survey and Benchmarks

...

855

29 Oct 2025

STAR-Bench: Probing Deep Spatio-Temporal Reasoning as Audio 4D Intelligence

...

563

28 Oct 2025

MRSAudio: A Large-Scale Multimodal Recorded Spatial Audio Dataset with Refined Annotations

...

303

12 Oct 2025

Revisiting Self-Play Preference Optimization: On the Role of Prompt Difficulty

138

07 Oct 2025

OWL: Geometry-Aware Spatial Reasoning for Audio Large Language Models

Subrata Biswas

Mohammad Nur Hossain Khan

Bashima Islam

VLM LRM

166

30 Sep 2025

Spatial Audio Motion Understanding and Reasoning

A. Sridhar

Yinyi Guo

Erik M. Visser

103

18 Sep 2025

DSpAST: Disentangled Representations for Spatial Audio Reasoning with Large Language Models

Kevin Wilkinghoff

Zheng-Hua Tan

194

17 Sep 2025

Deep Learning for Personalized Binaural Audio Reproduction

264

30 Aug 2025

ASAudio: A Survey of Advanced Spatial Audio Research

263

08 Aug 2025

SALM: Spatial Audio Language Model with Structured Embeddings for Understanding and Editing

241

22 Jul 2025

MOSPA: Human Motion Generation Driven by Spatial Audio

...

286

16 Jul 2025

video-SALMONN 2: Caption-Enhanced Audio-Visual Large Language Models

488

18 Jun 2025

Thinking in Directivity: Speech Large Language Model for Multi-Talker Directional Speech Recognition

...

197

17 Jun 2025

GRAM: Spatial general-purpose audio representation models for real-world applications

Goksenin Yuksel

Marcel van Gerven

Kiki van der Heijden

423

01 Jun 2025

ISDrama: Immersive Spatial Drama Generation through Multimodal Prompting

755

29 Apr 2025

Spatial Audio Processing with Large Language Model on Wearable Devices

Ayushi Mishra

Yang Bai

Priyadarshan Narayanasamy

Nakul Garg

Nirupam Roy

395

11 Apr 2025

Audio-Language Datasets of Scenes and Events: A SurveyIEEE Access (IEEE Access), 2024

630

10 Jan 2025

Enhancing Multimodal LLM for Detailed and Accurate Video Captioning using Multi-Round Preference Optimization

Changli Tang

Yixuan Li

Yudong Yang

Jimin Zhuang

Guangzhi Sun

Wei Li

Tianhao Shen

Chao Zhang

316

09 Oct 2024

MetaMetrics: Calibrating Metrics For Generation Tasks Using Human PreferencesInternational Conference on Learning Representations (ICLR), 2024

609

03 Oct 2024

Enabling Auditory Large Language Models for Automatic Speech Quality EvaluationIEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2024

Siyin Wang

Wenyi Yu

Yudong Yang

Changli Tang

Yixuan Li

...

Jun Zhang

Guangzhi Sun

Lu Lu

Yuxuan Wang

Chao Zhang

AuLLM LM&MA

440

25 Sep 2024

Computer Audition: From Task-Specific Machine Learning to Foundation Models

Andreas Triantafyllopoulos

475

22 Jul 2024

Can Large Language Models Understand Spatial Audio?

Changli Tang

Wenyi Yu

Guangzhi Sun

Xianzhao Chen

Tian Tan

...

Jun Zhang

Lu Lu

Zejun Ma

Yuxuan Wang

Chao Zhang

431

12 Jun 2024

SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning CapabilitiesComputer Vision and Pattern Recognition (CVPR), 2024

Dorsa Sadigh

414

714

22 Jan 2024

EAT: Self-Supervised Pre-Training with Efficient Audio Transformer

Xie Chen

377

07 Jan 2024

Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models

Yunfei Chu

Jin Xu

Xiaohuan Zhou

Qian Yang

Shiliang Zhang

Zhijie Yan

Chang Zhou

Jingren Zhou

AuLLM

449

700

14 Nov 2023

SALMONN: Towards Generic Hearing Abilities for Large Language Models

Changli Tang

Wenyi Yu

Guangzhi Sun

490

529

20 Oct 2023

LauraGPT: Listen, Attend, Understand, and Regenerate Audio with GPT

Zhihao Du

Jiaming Wang

Qian Chen

Yunfei Chu

Zhifu Gao

...

Wen Wang

Siqi Zheng

Chang Zhou

Zhijie Yan

Shiliang Zhang

LLMAG VLM AuLLM LM&MA

538

105

07 Oct 2023

Llama 2: Open Foundation and Fine-Tuned Chat Models

Louis Martin

...

Sharan Narang

Sergey Edunov

12.4K

16,448

18 Jul 2023

Kosmos-2: Grounding Multimodal Large Language Models to the WorldInternational Conference on Learning Representations (ICLR), 2023

585

1,151

26 Jun 2023

STARSS23: An Audio-Visual Dataset of Spatial Recordings of Real Scenes with Spatiotemporal Annotations of Sound EventsNeural Information Processing Systems (NeurIPS), 2023

Kazuki Shimada

Archontis Politis

Parthasaarathy Sudarsanam

...

327

15 Jun 2023

Pengi: An Audio Language Model for Audio TasksNeural Information Processing Systems (NeurIPS), 2023

528

268

19 May 2023

Listen, Think, and UnderstandInternational Conference on Learning Representations (ICLR), 2023

834

241

18 May 2023

LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model

...

Conghui He

Yu Qiao

372

734

28 Apr 2023

AudioGPT: Understanding and Generating Speech, Music, Sound, and Talking HeadAAAI Conference on Artificial Intelligence (AAAI), 2023

Rongjie Huang

Mingze Li

Dongchao Yang

Jiatong Shi

...

Zhou Zhao

285

376

25 Apr 2023

Visual Instruction TuningNeural Information Processing Systems (NeurIPS), 2023

1.4K

8,828

17 Apr 2023

LLaMA: Open and Efficient Foundation Language Models

...

20.2K

19,316

27 Feb 2023

BEATs: Audio Pre-Training with Acoustic TokenizersInternational Conference on Machine Learning (ICML), 2022

550

561

18 Dec 2022

Robust Speech Recognition via Large-Scale Weak SupervisionInternational Conference on Machine Learning (ICML), 2022

1.4K

6,745

06 Dec 2022

AudioLM: a Language Modeling Approach to Audio GenerationIEEE/ACM Transactions on Audio Speech and Language Processing (TASLP), 2022

Olivier Pietquin

...

559

892

07 Sep 2022

Masked Autoencoders that ListenNeural Information Processing Systems (NeurIPS), 2022

Po-Yao (Bernie) Huang

Christoph Feichtenhofer

673

424

13 Jul 2022

SoundSpaces 2.0: A Simulation Platform for Visual-Acoustic LearningNeural Information Processing Systems (NeurIPS), 2022

402

124

16 Jun 2022

L3DAS22 Challenge: Learning 3D Audio Sources in a Real Office EnvironmentIEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2022

324

21 Feb 2022

SRP-DNN: Learning Direct-Path Phase Difference for Multiple Moving Sound Source LocalizationIEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2022

Bing Yang

Hong Liu

Xiaofei Li

257

16 Feb 2022

$Pano-AVQA: Grounded Audio-Visual Question Answering on 360$^\circ$ Videos$

Pano-AVQA: Grounded Audio-Visual Question Answering on 360

^\circ

VideosIEEE International Conference on Computer Vision (ICCV), 2021

367

124

11 Oct 2021

Learning Representations from Audio-Visual Spatial Alignment

269

143

03 Nov 2020

ACCDOA: Activity-Coupled Cartesian Direction of Arrival Representation for Sound Event Localization and DetectionIEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2020

385

127

29 Oct 2020

Telling Left from Right: Learning Spatial Correspondence of Sight and SoundComputer Vision and Pattern Recognition (CVPR), 2020

256

11 Jun 2020

Conformer: Convolution-augmented Transformer for Speech Recognition

...

959

3,981

16 May 2020

The LOCATA Challenge: Acoustic Source Localization and TrackingIEEE/ACM Transactions on Audio Speech and Language Processing (TASLP), 2019

297

162

03 Sep 2019