Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2305.11834
Cited By
Pengi: An Audio Language Model for Audio Tasks
19 May 2023
Soham Deshmukh
Benjamin Elizalde
Rita Singh
Huaming Wang
MLLM
AuLLM
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Pengi: An Audio Language Model for Audio Tasks"
50 / 120 papers shown
Title
A Comprehensive Review of Multimodal Large Language Models: Performance and Challenges Across Different Tasks
Jiaqi Wang
Hanqi Jiang
Yi-Hsueh Liu
Chong Ma
Xu-Yao Zhang
...
Xin Zhang
Wei Zhang
Dinggang Shen
Tianming Liu
Shu Zhang
VLM
AI4TS
39
18
0
02 Aug 2024
Audio Entailment: Assessing Deductive Reasoning for Audio Understanding
Soham Deshmukh
Shuo Han
Hazim T. Bukhari
Benjamin Elizalde
Hannes Gamper
Rita Singh
Bhiksha Raj
ReLM
LRM
AuLLM
16
7
0
25 Jul 2024
Computer Audition: From Task-Specific Machine Learning to Foundation Models
Andreas Triantafyllopoulos
Iosif Tsangko
Alexander Gebhard
A. Mesaros
Tuomas Virtanen
Björn Schuller
39
4
0
22 Jul 2024
Qwen2-Audio Technical Report
Yunfei Chu
Jin Xu
Qian Yang
Haojie Wei
Xipin Wei
...
Yuanjun Lv
Jinzheng He
Junyang Lin
Chang Zhou
Jingren Zhou
AuLLM
VLM
29
100
0
15 Jul 2024
Factor-Conditioned Speaking-Style Captioning
Atsushi Ando
Takafumi Moriya
Shota Horiguchi
Ryo Masumura
22
0
0
27 Jun 2024
DeSTA: Enhancing Speech Language Models through Descriptive Speech-Text Alignment
Ke-Han Lu
Zhehuai Chen
Szu-Wei Fu
He Huang
Boris Ginsburg
Yu-Chiang Frank Wang
Hung-yi Lee
VLM
AuLLM
23
9
0
27 Jun 2024
AudioBench: A Universal Benchmark for Audio Large Language Models
Bin Wang
Xunlong Zou
Geyu Lin
S.
Zhuohan Liu
Wenyu Zhang
Zhengyuan Liu
AiTi Aw
Nancy F. Chen
AuLLM
ELM
LM&MA
85
17
0
23 Jun 2024
Improving Text-To-Audio Models with Synthetic Captions
Zhifeng Kong
Sang-gil Lee
Deepanway Ghosal
Navonil Majumder
Ambuj Mehrish
Rafael Valle
Soujanya Poria
Bryan Catanzaro
42
3
0
18 Jun 2024
GAMA: A Large Audio-Language Model with Advanced Audio Understanding and Complex Reasoning Abilities
Sreyan Ghosh
Sonal Kumar
Ashish Seth
Chandra Kiran Reddy Evuru
Utkarsh Tyagi
S. Sakshi
Oriol Nieto
R. Duraiswami
Dinesh Manocha
AuLLM
LRM
35
34
0
17 Jun 2024
Large Language Models for Dysfluency Detection in Stuttered Speech
Dominik Wagner
Sebastian P. Bayerl
Ilja Baumann
K. Riedhammer
Elmar Nöth
Tobias Bocklet
35
3
0
16 Jun 2024
Multimodal Large Language Models with Fusion Low Rank Adaptation for Device Directed Speech Detection
Shruti Palaskar
Oggi Rudovic
Sameer Dharur
Florian Pesce
G. Krishna
Aswin Sivaraman
Jack Berkowitz
Ahmed Hussen Abdelaziz
Saurabh N. Adya
Ahmed H. Tewfik
VLM
44
0
0
13 Jun 2024
Explore the Limits of Omni-modal Pretraining at Scale
Yiyuan Zhang
Handong Li
Jing Liu
Xiangyu Yue
VLM
LRM
35
1
0
13 Jun 2024
INS-MMBench: A Comprehensive Benchmark for Evaluating LVLMs' Performance in Insurance
Chenwei Lin
Hanjia Lyu
Xian Xu
Jiebo Luo
27
1
0
13 Jun 2024
Understanding Sounds, Missing the Questions: The Challenge of Object Hallucination in Large Audio-Language Models
Chun-Yi Kuan
Wei-Ping Huang
Hung-yi Lee
AuLLM
24
1
0
12 Jun 2024
ParaCLAP -- Towards a general language-audio model for computational paralinguistic tasks
Xin Jing
Andreas Triantafyllopoulos
Björn Schuller
22
2
0
11 Jun 2024
Zero-Shot End-To-End Spoken Question Answering In Medical Domain
Yanis Labrak
Adel Moumen
Richard Dufour
Mickael Rouvier
ELM
LM&MA
MedIm
23
0
0
09 Jun 2024
What do MLLMs hear? Examining reasoning with text and sound components in Multimodal Large Language Models
Enis Berk Çoban
Michael I. Mandel
Johanna Devaney
AuLLM
LRM
33
0
0
07 Jun 2024
AD-H: Autonomous Driving with Hierarchical Agents
Zaibin Zhang
Shiyu Tang
Yuanhang Zhang
Talas Fu
Yifan Wang
Yang Liu
Dong Wang
Jing Shao
Lijun Wang
H. Lu
42
3
0
05 Jun 2024
MaskSR: Masked Language Model for Full-band Speech Restoration
Xu Li
Qirui Wang
Xiaoyu Liu
20
8
0
04 Jun 2024
M2D-CLAP: Masked Modeling Duo Meets CLAP for Learning General-purpose Audio-Language Representation
Daisuke Niizumi
Daiki Takeuchi
Yasunori Ohishi
Noboru Harada
Masahiro Yasuda
Shunsuke Tsubaki
Keisuke Imoto
VLM
20
5
0
04 Jun 2024
Cross-Modal Safety Alignment: Is textual unlearning all you need?
Trishna Chakraborty
Erfan Shayegani
Zikui Cai
Nael B. Abu-Ghazaleh
M. Salman Asif
Yue Dong
A. Roy-Chowdhury
Chengyu Song
31
15
0
27 May 2024
Imp: Highly Capable Large Multimodal Models for Mobile Devices
Zhenwei Shao
Zhou Yu
Jun Yu
Xuecheng Ouyang
Lihao Zheng
Zhenbiao Gai
Mingyang Wang
Jiajun Ding
21
9
0
20 May 2024
SpeechGuard: Exploring the Adversarial Robustness of Multimodal Large Language Models
Raghuveer Peri
Sai Muralidhar Jayanthi
S. Ronanki
Anshu Bhatia
Karel Mundnich
...
Srikanth Vishnubhotla
Daniel Garcia-Romero
S. Srinivasan
Kyu J. Han
Katrin Kirchhoff
AAML
29
3
0
14 May 2024
SpeechVerse: A Large-scale Generalizable Audio Language Model
Nilaksh Das
Saket Dingliwal
S. Ronanki
Rohit Paturi
David Huang
...
Monica Sunkara
S. Srinivasan
Kyu J. Han
Katrin Kirchhoff
Katrin Kirchhoff
36
37
0
14 May 2024
Revisiting Multi-modal Emotion Learning with Broad State Space Models and Probability-guidance Fusion
Yuntao Shou
Tao Meng
Fuchen Zhang
Nan Yin
Keqin Li
Mamba
33
22
0
27 Apr 2024
Audio Dialogues: Dialogues dataset for audio and music understanding
Arushi Goel
Zhifeng Kong
Rafael Valle
Bryan Catanzaro
AuLLM
24
4
0
11 Apr 2024
WavLLM: Towards Robust and Adaptive Speech Large Language Model
Shujie Hu
Long Zhou
Shujie Liu
Sanyuan Chen
Hongkun Hao
...
Xunying Liu
Jinyu Li
S. Sivasankaran
Linquan Liu
Furu Wei
AuLLM
21
42
0
31 Mar 2024
An Image Grid Can Be Worth a Video: Zero-shot Video Question Answering Using a VLM
Wonkyun Kim
Changin Choi
Wonseok Lee
Wonjong Rhee
VLM
40
50
0
27 Mar 2024
A Multimodal Approach to Device-Directed Speech Detection with Large Language Models
Dominik Wagner
Alexander W. Churchill
Siddharth Sigtia
Panayiotis Georgiou
Matt Mirsamadi
Aarshee Mishra
Erik Marchi
35
6
0
21 Mar 2024
EDTC: enhance depth of text comprehension in automated audio captioning
Liwen Tan
Yin Cao
Yi Zhou
27
0
0
27 Feb 2024
Advancing Large Language Models to Capture Varied Speaking Styles and Respond Properly in Spoken Conversations
Guan-Ting Lin
Cheng-Han Chiang
Hung-yi Lee
23
22
0
20 Feb 2024
Model Composition for Multimodal Large Language Models
Chi Chen
Yiyang Du
Zheng Fang
Ziyue Wang
Fuwen Luo
...
Ming Yan
Ji Zhang
Fei Huang
Maosong Sun
Yang Janet Liu
MoMe
16
3
0
20 Feb 2024
Speech Translation with Speech Foundation Models and Large Language Models: What is There and What is Missing?
Marco Gaido
Sara Papi
Matteo Negri
L. Bentivogli
35
11
0
19 Feb 2024
Domain Adaptation for Contrastive Audio-Language Models
Soham Deshmukh
Rita Singh
Bhiksha Raj
VLM
19
7
0
14 Feb 2024
MINT: Boosting Audio-Language Model via Multi-Target Pre-Training and Instruction Tuning
Hang Zhao
Yifei Xin
Zhesong Yu
Bilei Zhu
Lu Lu
Zejun Ma
AuLLM
23
4
0
12 Feb 2024
Cacophony: An Improved Contrastive Audio-Text Model
Ge Zhu
Jordan Darefsky
Zhiyao Duan
AuLLM
33
11
0
10 Feb 2024
It's Never Too Late: Fusing Acoustic Information into Large Language Models for Automatic Speech Recognition
Chen Chen
Ruizhe Li
Yuchen Hu
Sabato Marco Siniscalchi
Pin-Yu Chen
Ensiong Chng
Chao-Han Huck Yang
24
19
0
08 Feb 2024
Audio Flamingo: A Novel Audio Language Model with Few-Shot Learning and Dialogue Abilities
Zhifeng Kong
Arushi Goel
Rohan Badlani
Wei Ping
Rafael Valle
Bryan Catanzaro
AuLLM
LM&MA
MLLM
59
73
0
02 Feb 2024
BAT: Learning to Reason about Spatial Sounds with Large Language Models
Zhisheng Zheng
Puyuan Peng
Ziyang Ma
Xie Chen
Eunsol Choi
David F. Harwath
LRM
28
13
0
02 Feb 2024
PAM: Prompting Audio-Language Models for Audio Quality Assessment
Soham Deshmukh
Dareen Alharthi
Benjamin Elizalde
Hannes Gamper
Mahmoud Al Ismail
Rita Singh
Bhiksha Raj
Huaming Wang
10
11
0
01 Feb 2024
EnCLAP: Combining Neural Audio Codec and Audio-Text Joint Embedding for Automated Audio Captioning
Jaeyeon Kim
Jaeyoon Jung
Jinjoo Lee
Sang Hoon Woo
CLIP
VLM
11
7
0
31 Jan 2024
On the Audio Hallucinations in Large Audio-Video Language Models
Taichi Nishimura
Shota Nakada
Masayoshi Kondo
VLM
17
5
0
18 Jan 2024
Learning Audio Concepts from Counterfactual Natural Language
A. Vosoughi
Luca Bondi
Ho-Hsiang Wu
Chenliang Xu
CML
37
2
0
10 Jan 2024
E-chat: Emotion-sensitive Spoken Dialogue System with Large Language Models
Hongfei Xue
Yuhao Liang
Bingshen Mu
Shiliang Zhang
Mengzhe Chen
Qian Chen
Lei Xie
AuLLM
19
9
0
31 Dec 2023
Toward General-Purpose Robots via Foundation Models: A Survey and Meta-Analysis
Yafei Hu
Quanting Xie
Vidhi Jain
Jonathan M Francis
Jay Patrikar
...
Xiaolong Wang
Sebastian A. Scherer
Z. Kira
Fei Xia
Yonatan Bisk
LM&Ro
AI4CE
24
54
0
14 Dec 2023
Honeybee: Locality-enhanced Projector for Multimodal LLM
Junbum Cha
Wooyoung Kang
Jonghwan Mun
Byungseok Roh
MLLM
13
111
0
11 Dec 2023
Multimodal Data and Resource Efficient Device-Directed Speech Detection with Large Foundation Models
Dominik Wagner
Alexander W. Churchill
Siddharth Sigtia
Panayiotis Georgiou
Matt Mirsamadi
Aarshee Mishra
Erik Marchi
10
3
0
06 Dec 2023
X-InstructBLIP: A Framework for aligning X-Modal instruction-aware representations to LLMs and Emergent Cross-modal Reasoning
Artemis Panagopoulou
Le Xue
Ning Yu
Junnan Li
Dongxu Li
Shafiq R. Joty
Ran Xu
Silvio Savarese
Caiming Xiong
Juan Carlos Niebles
VLM
MLLM
25
45
0
30 Nov 2023
Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models
Yunfei Chu
Jin Xu
Xiaohuan Zhou
Qian Yang
Shiliang Zhang
Zhijie Yan
Chang Zhou
Jingren Zhou
AuLLM
16
263
0
14 Nov 2023
Joint Music and Language Attention Models for Zero-shot Music Tagging
Xingjian Du
Zhesong Yu
Jiaju Lin
Bilei Zhu
Qiuqiang Kong
BDL
VLM
25
4
0
16 Oct 2023
Previous
1
2
3
Next