ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2305.11834
  4. Cited By
Pengi: An Audio Language Model for Audio Tasks

Pengi: An Audio Language Model for Audio Tasks

19 May 2023
Soham Deshmukh
Benjamin Elizalde
Rita Singh
Huaming Wang
    MLLM
    AuLLM
ArXivPDFHTML

Papers citing "Pengi: An Audio Language Model for Audio Tasks"

50 / 120 papers shown
Title
Multi-Domain Audio Question Answering Toward Acoustic Content Reasoning in The DCASE 2025 Challenge
Multi-Domain Audio Question Answering Toward Acoustic Content Reasoning in The DCASE 2025 Challenge
Chao-Han Huck Yang
Sreyan Ghosh
Qing Wang
Jaeyeon Kim
Hengyi Hong
...
Dinesh Manocha
Gunhee Kim
Jun Du
Rafael Valle
Bryan Catanzaro
20
0
0
12 May 2025
CaReAQA: A Cardiac and Respiratory Audio Question Answering Model for Open-Ended Diagnostic Reasoning
CaReAQA: A Cardiac and Respiratory Audio Question Answering Model for Open-Ended Diagnostic Reasoning
Tsai-Ning Wang
Lin-Lin Chen
Neil Zeghidour
Aaqib Saeed
AuLLM
LM&MA
45
0
0
02 May 2025
A Survey of Interactive Generative Video
A Survey of Interactive Generative Video
Jiwen Yu
Yiran Qin
Haoxuan Che
Quande Liu
X. Wang
Pengfei Wan
Di Zhang
Kun Gai
Hao Chen
Xihui Liu
VGen
53
0
0
30 Apr 2025
Enhancing Non-Core Language Instruction-Following in Speech LLMs via Semi-Implicit Cross-Lingual CoT Reasoning
Enhancing Non-Core Language Instruction-Following in Speech LLMs via Semi-Implicit Cross-Lingual CoT Reasoning
Hongfei Xue
Yufeng Tang
Hexin Liu
Jun Zhang
Xuelong Geng
Lei Xie
LRM
50
0
0
29 Apr 2025
Transformation of audio embeddings into interpretable, concept-based representations
Transformation of audio embeddings into interpretable, concept-based representations
Alice Zhang
Edison Thomaz
Lie Lu
21
0
0
18 Apr 2025
Make Some Noise: Towards LLM audio reasoning and generation using sound tokens
Make Some Noise: Towards LLM audio reasoning and generation using sound tokens
Shivam Mehta
Nebojsa Jojic
Hannes Gamper
28
0
0
28 Mar 2025
Qwen2.5-Omni Technical Report
Qwen2.5-Omni Technical Report
Jin Xu
Zhifang Guo
Jinzheng He
Hangrui Hu
Ting He
...
K. Dang
Bin Zhang
X. Wang
Yunfei Chu
Junyang Lin
VGen
AuLLM
86
12
0
26 Mar 2025
Rethinking Vision-Language Model in Face Forensics: Multi-Modal Interpretable Forged Face Detector
Rethinking Vision-Language Model in Face Forensics: Multi-Modal Interpretable Forged Face Detector
Xiao Guo
Xiufeng Song
Yue Zhang
Xiaohong Liu
X. Liu
53
1
0
26 Mar 2025
Position: Interactive Generative Video as Next-Generation Game Engine
Position: Interactive Generative Video as Next-Generation Game Engine
Jiwen Yu
Yiran Qin
Haoxuan Che
Quande Liu
Xintao Wang
Pengfei Wan
Di Zhang
Xihui Liu
VGen
45
1
0
21 Mar 2025
Quality Over Quantity? LLM-Based Curation for a Data-Efficient Audio-Video Foundation Model
Ali Vosoughi
Dimitra Emmanouilidou
H. Gamper
50
0
0
12 Mar 2025
Mellow: a small audio language model for reasoning
Soham Deshmukh
Satvik Dixit
Rita Singh
Bhiksha Raj
AuLLM
ReLM
LRM
75
1
0
11 Mar 2025
CL-MoE: Enhancing Multimodal Large Language Model with Dual Momentum Mixture-of-Experts for Continual Visual Question Answering
Tianyu Huai
Jie Zhou
Xingjiao Wu
Qin Chen
Qingchun Bai
Ze Zhou
Liang He
MoE
30
0
0
01 Mar 2025
Baichuan-Audio: A Unified Framework for End-to-End Speech Interaction
Baichuan-Audio: A Unified Framework for End-to-End Speech Interaction
Tianpeng Li
J. Liu
Tao Zhang
Yuanbo Fang
Da Pan
...
Guosheng Dong
Jianhua Xu
Haoze Sun
Zenan Zhou
Weipeng Chen
AuLLM
47
3
0
24 Feb 2025
Soundwave: Less is More for Speech-Text Alignment in LLMs
Soundwave: Less is More for Speech-Text Alignment in LLMs
Y. Zhang
Zhiheng Liu
Fan Bu
Ruiyu Zhang
Benyou Wang
H. Li
AuLLM
SyDa
VLM
98
0
0
18 Feb 2025
From No to Know: Taxonomy, Challenges, and Opportunities for Negation Understanding in Multimodal Foundation Models
From No to Know: Taxonomy, Challenges, and Opportunities for Negation Understanding in Multimodal Foundation Models
Mayank Vatsa
Aparna Bharati
S. Mittal
Richa Singh
53
0
0
10 Feb 2025
Unbiased Sliced Wasserstein Kernels for High-Quality Audio Captioning
Unbiased Sliced Wasserstein Kernels for High-Quality Audio Captioning
Manh Luong
Khai Nguyen
Dinh Q. Phung
Gholamreza Haffari
Lizhen Qu
47
0
0
08 Feb 2025
Audio-Language Models for Audio-Centric Tasks: A survey
Yi Su
Jisheng Bai
Qisheng Xu
Kele Xu
Yong Dou
AuLLM
99
1
0
28 Jan 2025
AudioBERT: Audio Knowledge Augmented Language Model
AudioBERT: Audio Knowledge Augmented Language Model
Hyunjong Ok
Suho Yoo
Jaeho Lee
AuLLM
RALM
VLM
40
0
0
17 Jan 2025
Audio-Language Datasets of Scenes and Events: A Survey
Audio-Language Datasets of Scenes and Events: A Survey
Gijs Wijngaard
Elia Formisano
Michele Esposito
M. Dumontier
70
2
0
10 Jan 2025
OneLLM: One Framework to Align All Modalities with Language
OneLLM: One Framework to Align All Modalities with Language
Jiaming Han
Kaixiong Gong
Yiyuan Zhang
Jiaqi Wang
Kaipeng Zhang
D. Lin
Yu Qiao
Peng Gao
Xiangyu Yue
MLLM
101
102
0
10 Jan 2025
"Yeah Right!" -- Do LLMs Exhibit Multimodal Feature Transfer?
"Yeah Right!" -- Do LLMs Exhibit Multimodal Feature Transfer?
Benjamin Z. Reichman
Kartik Talamadupula
38
0
0
07 Jan 2025
Instruction-Guided Scene Text Recognition
Instruction-Guided Scene Text Recognition
Yongkun Du
Z. Chen
Yuchen Su
Caiyan Jia
Yu-Gang Jiang
62
3
0
03 Jan 2025
Can Large Audio-Language Models Truly Hear? Tackling Hallucinations with Multi-Task Assessment and Stepwise Audio Reasoning
Can Large Audio-Language Models Truly Hear? Tackling Hallucinations with Multi-Task Assessment and Stepwise Audio Reasoning
Chun-Yi Kuan
Hung-yi Lee
AuLLM
LRM
56
1
0
03 Jan 2025
Multiple Consistency-guided Test-Time Adaptation for Contrastive
  Audio-Language Models with Unlabeled Audio
Multiple Consistency-guided Test-Time Adaptation for Contrastive Audio-Language Models with Unlabeled Audio
Gongyu Chen
Haomin Zhang
Chaofan Ding
Zihao Chen
Xinhan Di
30
0
0
23 Dec 2024
Empowering LLMs to Understand and Generate Complex Vector Graphics
Empowering LLMs to Understand and Generate Complex Vector Graphics
Ximing Xing
Juncheng Hu
Guotao Liang
Jing Zhang
Dong Xu
Qian Yu
81
7
0
15 Dec 2024
MotionLLaMA: A Unified Framework for Motion Synthesis and Comprehension
MotionLLaMA: A Unified Framework for Motion Synthesis and Comprehension
Zeyu Ling
Bo Han
Shiyang Li
H. Shen
Jikang Cheng
Changqing Zou
79
1
0
26 Nov 2024
State-Space Large Audio Language Models
State-Space Large Audio Language Models
Saurabhchand Bhati
Yuan Gong
Leonid Karlinsky
Hilde Kuehne
Rogerio Feris
James Glass
87
0
0
24 Nov 2024
MACE: Leveraging Audio for Evaluating Audio Captioning Systems
MACE: Leveraging Audio for Evaluating Audio Captioning Systems
Satvik Dixit
Soham Deshmukh
Bhiksha Raj
27
1
0
01 Nov 2024
MMAU: A Massive Multi-Task Audio Understanding and Reasoning Benchmark
MMAU: A Massive Multi-Task Audio Understanding and Reasoning Benchmark
S. Sakshi
Utkarsh Tyagi
Sonal Kumar
Ashish Seth
Ramaneswaran Selvakumar
Oriol Nieto
R. Duraiswami
Sreyan Ghosh
Dinesh Manocha
AuLLM
ELM
65
19
0
24 Oct 2024
AVHBench: A Cross-Modal Hallucination Benchmark for Audio-Visual Large Language Models
AVHBench: A Cross-Modal Hallucination Benchmark for Audio-Visual Large Language Models
Kim Sung-Bin
Oh Hyun-Bin
JungMok Lee
Arda Senocak
Joon Son Chung
Tae-Hyun Oh
MLLM
VLM
29
2
0
23 Oct 2024
Generative AI Agents in Autonomous Machines: A Safety Perspective
Generative AI Agents in Autonomous Machines: A Safety Perspective
Jason J. Jabbour
Vijay Janapa Reddi
AI4CE
36
3
0
20 Oct 2024
Ichigo: Mixed-Modal Early-Fusion Realtime Voice Assistant
Ichigo: Mixed-Modal Early-Fusion Realtime Voice Assistant
Alan Dao
Dinh Bach Vu
Huy Hoang Ha
AuLLM
VLM
57
3
0
20 Oct 2024
Roadmap towards Superhuman Speech Understanding using Large Language
  Models
Roadmap towards Superhuman Speech Understanding using Large Language Models
Fan Bu
Yuhao Zhang
X. Wang
Benyou Wang
Q. Liu
H. Li
LM&MA
ELM
AuLLM
33
1
0
17 Oct 2024
An Eye for an Ear: Zero-shot Audio Description Leveraging an Image
  Captioner using Audiovisual Distribution Alignment
An Eye for an Ear: Zero-shot Audio Description Leveraging an Image Captioner using Audiovisual Distribution Alignment
Hugo Malard
Michel Olvera
Stéphane Lathuilière
S. Essid
VLM
22
0
0
08 Oct 2024
MINER: Mining the Underlying Pattern of Modality-Specific Neurons in
  Multimodal Large Language Models
MINER: Mining the Underlying Pattern of Modality-Specific Neurons in Multimodal Large Language Models
Kaichen Huang
Jiahao Huo
Yibo Yan
Kun Wang
Yutao Yue
Xuming Hu
28
2
0
07 Oct 2024
Distilling an End-to-End Voice Assistant Without Instruction Training
  Data
Distilling an End-to-End Voice Assistant Without Instruction Training Data
William B. Held
Ella Li
Michael Joseph Ryan
Weiyan Shi
Yanzhe Zhang
Diyi Yang
AuLLM
29
8
0
03 Oct 2024
PALM: Few-Shot Prompt Learning for Audio Language Models
PALM: Few-Shot Prompt Learning for Audio Language Models
Asif Hanif
M. Agro
Mohammad Areeb Qazi
Hanan Aldarmaki
VLM
16
1
0
29 Sep 2024
Beyond Single-Audio: Advancing Multi-Audio Processing in Audio Large
  Language Models
Beyond Single-Audio: Advancing Multi-Audio Processing in Audio Large Language Models
Yiming Chen
Xianghu Yue
Xiaoxue Gao
Chen Zhang
L. F. D’Haro
R. Tan
Haizhou Li
AuLLM
30
0
0
27 Sep 2024
Semi-intrusive audio evaluation: Casting non-intrusive assessment as a multi-modal text prediction task
Semi-intrusive audio evaluation: Casting non-intrusive assessment as a multi-modal text prediction task
Jozef Coldenhoff
Milos Cernak
23
0
0
21 Sep 2024
Large Language Models are Strong Audio-Visual Speech Recognition Learners
Large Language Models are Strong Audio-Visual Speech Recognition Learners
Umberto Cappellazzo
Minsu Kim
Honglie Chen
Pingchuan Ma
Stavros Petridis
Daniele Falavigna
Alessio Brutti
Maja Pantic
28
9
0
18 Sep 2024
Integrating Audio Narrations to Strengthen Domain Generalization in
  Multimodal First-Person Action Recognition
Integrating Audio Narrations to Strengthen Domain Generalization in Multimodal First-Person Action Recognition
Cagri Gungor
Adriana Kovashka
EgoV
19
0
0
15 Sep 2024
Towards Diverse and Efficient Audio Captioning via Diffusion Models
Towards Diverse and Efficient Audio Captioning via Diffusion Models
Manjie Xu
Chenxing Li
Xinyi Tu
Yong Ren
Ruibo Fu
Wei Liang
Dong Yu
DiffM
35
1
0
14 Sep 2024
ReCLAP: Improving Zero Shot Audio Classification by Describing Sounds
ReCLAP: Improving Zero Shot Audio Classification by Describing Sounds
Sreyan Ghosh
Sonal Kumar
Chandra Kiran Reddy Evuru
Oriol Nieto
R. Duraiswami
Dinesh Manocha
VLM
22
0
0
13 Sep 2024
TSELM: Target Speaker Extraction using Discrete Tokens and Language
  Models
TSELM: Target Speaker Extraction using Discrete Tokens and Language Models
Beilong Tang
Bang Zeng
Ming Li
23
2
0
12 Sep 2024
Enhancing Temporal Understanding in Audio Question Answering for Large
  Audio Language Models
Enhancing Temporal Understanding in Audio Question Answering for Large Audio Language Models
A. Sridhar
Yinyi Guo
Erik M. Visser
AuLLM
25
0
0
10 Sep 2024
MoWE-Audio: Multitask AudioLLMs with Mixture of Weak Encoders
MoWE-Audio: Multitask AudioLLMs with Mixture of Weak Encoders
W. Zhang
Shuo Sun
Bin Wang
Xunlong Zou
Zhuohan Liu
Yingxu He
Geyu Lin
Nancy F. Chen
A. Aw
AuLLM
65
1
0
10 Sep 2024
LLaMA-Omni: Seamless Speech Interaction with Large Language Models
LLaMA-Omni: Seamless Speech Interaction with Large Language Models
Qingkai Fang
Shoutao Guo
Yan Zhou
Zhengrui Ma
Shaolei Zhang
Yang Feng
AuLLM
25
29
0
10 Sep 2024
EnCLAP++: Analyzing the EnCLAP Framework for Optimizing Automated Audio
  Captioning Performance
EnCLAP++: Analyzing the EnCLAP Framework for Optimizing Automated Audio Captioning Performance
Jaeyeon Kim
Minjeon Jeon
Jaeyoon Jung
Sang Hoon Woo
Jinjoo Lee
15
2
0
02 Sep 2024
Expanding on EnCLAP with Auxiliary Retrieval Model for Automated Audio
  Captioning
Expanding on EnCLAP with Auxiliary Retrieval Model for Automated Audio Captioning
Jaeyeon Kim
Jaeyoon Jung
Minjeong Jeon
Sang Hoon Woo
Jinjoo Lee
16
1
0
02 Sep 2024
MuChoMusic: Evaluating Music Understanding in Multimodal Audio-Language
  Models
MuChoMusic: Evaluating Music Understanding in Multimodal Audio-Language Models
Yunwen Xia
Hui Fang
Emmanouil Benetos
Jie Zhang
Chong Long
Dmitry Bogdanov
AuLLM
41
1
0
02 Aug 2024
123
Next