Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2305.11834
Cited By
Pengi: An Audio Language Model for Audio Tasks
19 May 2023
Soham Deshmukh
Benjamin Elizalde
Rita Singh
Huaming Wang
MLLM
AuLLM
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Pengi: An Audio Language Model for Audio Tasks"
50 / 120 papers shown
Title
Multi-Domain Audio Question Answering Toward Acoustic Content Reasoning in The DCASE 2025 Challenge
Chao-Han Huck Yang
Sreyan Ghosh
Qing Wang
Jaeyeon Kim
Hengyi Hong
...
Dinesh Manocha
Gunhee Kim
Jun Du
Rafael Valle
Bryan Catanzaro
16
0
0
12 May 2025
CaReAQA: A Cardiac and Respiratory Audio Question Answering Model for Open-Ended Diagnostic Reasoning
Tsai-Ning Wang
Lin-Lin Chen
Neil Zeghidour
Aaqib Saeed
AuLLM
LM&MA
43
0
0
02 May 2025
A Survey of Interactive Generative Video
Jiwen Yu
Yiran Qin
Haoxuan Che
Quande Liu
X. Wang
Pengfei Wan
Di Zhang
Kun Gai
Hao Chen
Xihui Liu
VGen
53
0
0
30 Apr 2025
Enhancing Non-Core Language Instruction-Following in Speech LLMs via Semi-Implicit Cross-Lingual CoT Reasoning
Hongfei Xue
Yufeng Tang
Hexin Liu
Jun Zhang
Xuelong Geng
Lei Xie
LRM
50
0
0
29 Apr 2025
Transformation of audio embeddings into interpretable, concept-based representations
Alice Zhang
Edison Thomaz
Lie Lu
17
0
0
18 Apr 2025
Make Some Noise: Towards LLM audio reasoning and generation using sound tokens
Shivam Mehta
Nebojsa Jojic
Hannes Gamper
26
0
0
28 Mar 2025
Qwen2.5-Omni Technical Report
Jin Xu
Zhifang Guo
Jinzheng He
Hangrui Hu
Ting He
...
K. Dang
Bin Zhang
X. Wang
Yunfei Chu
Junyang Lin
VGen
AuLLM
86
12
0
26 Mar 2025
Rethinking Vision-Language Model in Face Forensics: Multi-Modal Interpretable Forged Face Detector
Xiao Guo
Xiufeng Song
Yue Zhang
Xiaohong Liu
X. Liu
51
1
0
26 Mar 2025
Position: Interactive Generative Video as Next-Generation Game Engine
Jiwen Yu
Yiran Qin
Haoxuan Che
Quande Liu
Xintao Wang
Pengfei Wan
Di Zhang
Xihui Liu
VGen
45
1
0
21 Mar 2025
Quality Over Quantity? LLM-Based Curation for a Data-Efficient Audio-Video Foundation Model
Ali Vosoughi
Dimitra Emmanouilidou
H. Gamper
50
0
0
12 Mar 2025
Mellow: a small audio language model for reasoning
Soham Deshmukh
Satvik Dixit
Rita Singh
Bhiksha Raj
AuLLM
ReLM
LRM
69
1
0
11 Mar 2025
CL-MoE: Enhancing Multimodal Large Language Model with Dual Momentum Mixture-of-Experts for Continual Visual Question Answering
Tianyu Huai
Jie Zhou
Xingjiao Wu
Qin Chen
Qingchun Bai
Ze Zhou
Liang He
MoE
30
0
0
01 Mar 2025
Baichuan-Audio: A Unified Framework for End-to-End Speech Interaction
Tianpeng Li
J. Liu
Tao Zhang
Yuanbo Fang
Da Pan
...
Guosheng Dong
Jianhua Xu
Haoze Sun
Zenan Zhou
Weipeng Chen
AuLLM
45
3
0
24 Feb 2025
Soundwave: Less is More for Speech-Text Alignment in LLMs
Y. Zhang
Zhiheng Liu
Fan Bu
Ruiyu Zhang
Benyou Wang
H. Li
AuLLM
SyDa
VLM
98
0
0
18 Feb 2025
From No to Know: Taxonomy, Challenges, and Opportunities for Negation Understanding in Multimodal Foundation Models
Mayank Vatsa
Aparna Bharati
S. Mittal
Richa Singh
53
0
0
10 Feb 2025
Unbiased Sliced Wasserstein Kernels for High-Quality Audio Captioning
Manh Luong
Khai Nguyen
Dinh Q. Phung
Gholamreza Haffari
Lizhen Qu
44
0
0
08 Feb 2025
Audio-Language Models for Audio-Centric Tasks: A survey
Yi Su
Jisheng Bai
Qisheng Xu
Kele Xu
Yong Dou
AuLLM
99
1
0
28 Jan 2025
AudioBERT: Audio Knowledge Augmented Language Model
Hyunjong Ok
Suho Yoo
Jaeho Lee
AuLLM
RALM
VLM
40
0
0
17 Jan 2025
Audio-Language Datasets of Scenes and Events: A Survey
Gijs Wijngaard
Elia Formisano
Michele Esposito
M. Dumontier
63
2
0
10 Jan 2025
OneLLM: One Framework to Align All Modalities with Language
Jiaming Han
Kaixiong Gong
Yiyuan Zhang
Jiaqi Wang
Kaipeng Zhang
D. Lin
Yu Qiao
Peng Gao
Xiangyu Yue
MLLM
96
102
0
10 Jan 2025
"Yeah Right!" -- Do LLMs Exhibit Multimodal Feature Transfer?
Benjamin Z. Reichman
Kartik Talamadupula
38
0
0
07 Jan 2025
Instruction-Guided Scene Text Recognition
Yongkun Du
Z. Chen
Yuchen Su
Caiyan Jia
Yu-Gang Jiang
53
3
0
03 Jan 2025
Can Large Audio-Language Models Truly Hear? Tackling Hallucinations with Multi-Task Assessment and Stepwise Audio Reasoning
Chun-Yi Kuan
Hung-yi Lee
AuLLM
LRM
54
1
0
03 Jan 2025
Multiple Consistency-guided Test-Time Adaptation for Contrastive Audio-Language Models with Unlabeled Audio
Gongyu Chen
Haomin Zhang
Chaofan Ding
Zihao Chen
Xinhan Di
30
0
0
23 Dec 2024
Empowering LLMs to Understand and Generate Complex Vector Graphics
Ximing Xing
Juncheng Hu
Guotao Liang
Jing Zhang
Dong Xu
Qian Yu
78
7
0
15 Dec 2024
MotionLLaMA: A Unified Framework for Motion Synthesis and Comprehension
Zeyu Ling
Bo Han
Shiyang Li
H. Shen
Jikang Cheng
Changqing Zou
79
1
0
26 Nov 2024
State-Space Large Audio Language Models
Saurabhchand Bhati
Yuan Gong
Leonid Karlinsky
Hilde Kuehne
Rogerio Feris
James Glass
82
0
0
24 Nov 2024
MACE: Leveraging Audio for Evaluating Audio Captioning Systems
Satvik Dixit
Soham Deshmukh
Bhiksha Raj
25
1
0
01 Nov 2024
MMAU: A Massive Multi-Task Audio Understanding and Reasoning Benchmark
S. Sakshi
Utkarsh Tyagi
Sonal Kumar
Ashish Seth
Ramaneswaran Selvakumar
Oriol Nieto
R. Duraiswami
Sreyan Ghosh
Dinesh Manocha
AuLLM
ELM
65
19
0
24 Oct 2024
AVHBench: A Cross-Modal Hallucination Benchmark for Audio-Visual Large Language Models
Kim Sung-Bin
Oh Hyun-Bin
JungMok Lee
Arda Senocak
Joon Son Chung
Tae-Hyun Oh
MLLM
VLM
29
2
0
23 Oct 2024
Generative AI Agents in Autonomous Machines: A Safety Perspective
Jason J. Jabbour
Vijay Janapa Reddi
AI4CE
33
3
0
20 Oct 2024
Ichigo: Mixed-Modal Early-Fusion Realtime Voice Assistant
Alan Dao
Dinh Bach Vu
Huy Hoang Ha
AuLLM
VLM
57
3
0
20 Oct 2024
Roadmap towards Superhuman Speech Understanding using Large Language Models
Fan Bu
Yuhao Zhang
X. Wang
Benyou Wang
Q. Liu
H. Li
LM&MA
ELM
AuLLM
30
1
0
17 Oct 2024
An Eye for an Ear: Zero-shot Audio Description Leveraging an Image Captioner using Audiovisual Distribution Alignment
Hugo Malard
Michel Olvera
Stéphane Lathuilière
S. Essid
VLM
17
0
0
08 Oct 2024
MINER: Mining the Underlying Pattern of Modality-Specific Neurons in Multimodal Large Language Models
Kaichen Huang
Jiahao Huo
Yibo Yan
Kun Wang
Yutao Yue
Xuming Hu
25
2
0
07 Oct 2024
Distilling an End-to-End Voice Assistant Without Instruction Training Data
William B. Held
Ella Li
Michael Joseph Ryan
Weiyan Shi
Yanzhe Zhang
Diyi Yang
AuLLM
29
8
0
03 Oct 2024
PALM: Few-Shot Prompt Learning for Audio Language Models
Asif Hanif
M. Agro
Mohammad Areeb Qazi
Hanan Aldarmaki
VLM
16
1
0
29 Sep 2024
Beyond Single-Audio: Advancing Multi-Audio Processing in Audio Large Language Models
Yiming Chen
Xianghu Yue
Xiaoxue Gao
Chen Zhang
L. F. D’Haro
R. Tan
Haizhou Li
AuLLM
22
0
0
27 Sep 2024
Semi-intrusive audio evaluation: Casting non-intrusive assessment as a multi-modal text prediction task
Jozef Coldenhoff
Milos Cernak
21
0
0
21 Sep 2024
Large Language Models are Strong Audio-Visual Speech Recognition Learners
Umberto Cappellazzo
Minsu Kim
Honglie Chen
Pingchuan Ma
Stavros Petridis
Daniele Falavigna
Alessio Brutti
Maja Pantic
18
9
0
18 Sep 2024
Integrating Audio Narrations to Strengthen Domain Generalization in Multimodal First-Person Action Recognition
Cagri Gungor
Adriana Kovashka
EgoV
19
0
0
15 Sep 2024
Towards Diverse and Efficient Audio Captioning via Diffusion Models
Manjie Xu
Chenxing Li
Xinyi Tu
Yong Ren
Ruibo Fu
Wei Liang
Dong Yu
DiffM
33
1
0
14 Sep 2024
ReCLAP: Improving Zero Shot Audio Classification by Describing Sounds
Sreyan Ghosh
Sonal Kumar
Chandra Kiran Reddy Evuru
Oriol Nieto
R. Duraiswami
Dinesh Manocha
VLM
22
0
0
13 Sep 2024
TSELM: Target Speaker Extraction using Discrete Tokens and Language Models
Beilong Tang
Bang Zeng
Ming Li
18
2
0
12 Sep 2024
Enhancing Temporal Understanding in Audio Question Answering for Large Audio Language Models
A. Sridhar
Yinyi Guo
Erik M. Visser
AuLLM
22
0
0
10 Sep 2024
MoWE-Audio: Multitask AudioLLMs with Mixture of Weak Encoders
W. Zhang
Shuo Sun
Bin Wang
Xunlong Zou
Zhuohan Liu
Yingxu He
Geyu Lin
Nancy F. Chen
A. Aw
AuLLM
65
1
0
10 Sep 2024
LLaMA-Omni: Seamless Speech Interaction with Large Language Models
Qingkai Fang
Shoutao Guo
Yan Zhou
Zhengrui Ma
Shaolei Zhang
Yang Feng
AuLLM
25
29
0
10 Sep 2024
EnCLAP++: Analyzing the EnCLAP Framework for Optimizing Automated Audio Captioning Performance
Jaeyeon Kim
Minjeon Jeon
Jaeyoon Jung
Sang Hoon Woo
Jinjoo Lee
13
2
0
02 Sep 2024
Expanding on EnCLAP with Auxiliary Retrieval Model for Automated Audio Captioning
Jaeyeon Kim
Jaeyoon Jung
Minjeong Jeon
Sang Hoon Woo
Jinjoo Lee
13
1
0
02 Sep 2024
MuChoMusic: Evaluating Music Understanding in Multimodal Audio-Language Models
Yunwen Xia
Hui Fang
Emmanouil Benetos
Jie Zhang
Chong Long
Dmitry Bogdanov
AuLLM
38
1
0
02 Aug 2024
1
2
3
Next