Communities
Connect sessions
AI calendar
Organizations
Join Slack
Contact Sales
Search
Open menu
Home
Papers
2503.20215
Cited By
Qwen2.5-Omni Technical Report
26 March 2025
Jin Xu
Zhifang Guo
Jinzheng He
Hangrui Hu
Ting He
S. Bai
Keqin Chen
Jialin Wang
Yang Fan
K. Dang
Bin Zhang
Xinyu Wang
Yunfei Chu
Junyang Lin
VGen
AuLLM
Re-assign community
ArXiv (abs)
PDF
HTML
HuggingFace (164 upvotes)
Papers citing
"Qwen2.5-Omni Technical Report"
50 / 242 papers shown
Kwai Keye-VL 1.5 Technical Report
Biao Yang
Bin Wen
Boyang Ding
Changyi Liu
Chenglong Chu
...
S. Wang
X. Luo
Yan Li
Yuhang Hu
Zixing Zhang
VLM
325
15
0
01 Sep 2025
WoW-Bench: Evaluating Fine-Grained Acoustic Perception in Audio-Language Models via Marine Mammal Vocalizations
J. Kim
Heeseung Yun
Sang Hoon Woo
Chao-Han Huck Yang
Gunhee Kim
AuLLM
114
0
0
28 Aug 2025
ChipChat: Low-Latency Cascaded Conversational Agent in MLX
Tatiana Likhomanenko
Luke Carlson
Richard He Bai
Zijin Gu
Han Tran
Zakaria Aldeneh
Yizhe Zhang
Ruixiang Zhang
Huangjie Zheng
Navdeep Jaitly
105
1
0
26 Aug 2025
SEAM: Semantically Equivalent Across Modalities Benchmark for Vision-Language Models
Zhenwei Tang
Difan Jiao
Blair Yang
Ashton Anderson
VLM
CoGe
142
1
0
25 Aug 2025
Speech-Based Cognitive Screening: A Systematic Evaluation of LLM Adaptation Strategies
Fatemeh Taherinezhad
Mohamad Javad Momeni Nezhad
Sepehr Karimi
Sina Rashidi
Ali Zolnour
Maryam Dadkhah
Yasaman Haghbin
Hossein Azadmaleki
Maryam Zolnoori
90
1
0
24 Aug 2025
TaDiCodec: Text-aware Diffusion Speech Tokenizer for Speech Language Modeling
Yuancheng Wang
Dekun Chen
Xueyao Zhang
Junan Zhang
Jiaqi Li
Zhizheng Wu
228
4
0
22 Aug 2025
Mini-Omni-Reasoner: Token-Level Thinking-in-Speaking in Large Speech Models
Zhifei Xie
Ziyang Ma
Zihang Liu
Kaiyu Pang
Hongyu Li
J. Zhang
Yue Liao
Deheng Ye
Chunyan Miao
Shuicheng Yan
AuLLM
LRM
264
7
0
18 Aug 2025
RadarQA: Multi-modal Quality Analysis of Weather Radar Forecasts
Xuming He
Zhiyuan You
Junchao Gong
Couhua Liu
Xiaoyu Yue
Peiqin Zhuang
Wenlong Zhang
Wenlong Zhang
92
3
0
17 Aug 2025
Audio Flamingo Sound-CoT Technical Report: Improving Chain-of-Thought Reasoning in Sound Understanding
Zhifeng Kong
Arushi Goel
J. F. Santos
Sreyan Ghosh
Rafael Valle
Wei Ping
Bryan Catanzaro
ReLM
AuLLM
LRM
178
2
0
15 Aug 2025
Seeing, Listening, Remembering, and Reasoning: A Multimodal Agent with Long-Term Memory
Lin Long
Yexiao He
Wentao Ye
Yiyuan Pan
Yuan Lin
Hang Li
Junbo Zhao
Wei Li
346
8
0
13 Aug 2025
MME-Emotion: A Holistic Evaluation Benchmark for Emotional Intelligence in Multimodal Large Language Models
Fan Zhang
Minghan Li
Chong Deng
Xue Yang
Zheng Lian
...
Xian Wu
Kun Wang
Xiangang Li
Jieping Ye
Pheng-Ann Heng
AI4MH
153
3
0
11 Aug 2025
Omni-SafetyBench: A Benchmark for Safety Evaluation of Audio-Visual Large Language Models
Leyi Pan
Zheyu Fu
Yunpeng Zhai
Shuchang Tao
Sheng Guan
...
Zhaoyang Liu
Bolin Ding
Felix Henry
Lijie Wen
Aiwei Liu
MLLM
ELM
197
1
0
10 Aug 2025
LLMCARE: early detection of cognitive impairment via transformer models enhanced by LLM-generated synthetic data
Frontiers in Artificial Intelligence (Front. Artif. Intell.), 2025
Ali Zolnour
Hossein Azadmaleki
Yasaman Haghbin
Fatemeh Taherinezhad
Mohamad Javad Momeni Nezhad
...
Suzanne Bakken
Yadollah Yaghoobzadeh
Abdol-Hossein Vahabie
Masoud Rouhizadeh
Maryam Zolnoori
LM&MA
143
0
0
08 Aug 2025
Training-Free Multimodal Large Language Model Orchestration
Tianyu Xie
Yuhang Wu
Yongdong Luo
Jinfa Huang
Xiawu Zheng
137
0
0
06 Aug 2025
OmniPlay: Benchmarking Omni-Modal Models on Omni-Modal Game Playing
Fuqing Bie
Shiyu Huang
Xijia Tao
Zhiqin Fang
Leyi Pan
Junzhe Chen
Min Ren
Liuyu Xiang
Zhaofeng He
189
0
0
06 Aug 2025
RealTalk-CN: A Realistic Chinese Speech-Text Dialogue Benchmark With Cross-Modal Interaction Analysis
Enzhi Wang
Qicheng Li
Shiwan Zhao
Aobo Kong
Jiaming Zhou
X. Yang
Yequan Wang
Yonghua Lin
Yong Qin
71
0
0
06 Aug 2025
ESDD 2026: Environmental Sound Deepfake Detection Challenge Evaluation Plan
Han Yin
Yang Xiao
Rohan Kumar Das
Jisheng Bai
Ting Dang
119
5
0
06 Aug 2025
MiDashengLM: Efficient Audio Understanding with General Audio Captions
Heinrich Dinkel
Gang Li
Jizhong Liu
Jian Luan
Yadong Niu
Xingwei Sun
Tianzi Wang
Qiyang Xiao
Junbo Zhang
Jiahao Zhou
AuLLM
AI4TS
VLM
422
13
0
06 Aug 2025
AVATAR: Reinforcement Learning to See, Hear, and Reason Over Video
Yogesh Kulkarni
Pooyan Fazli
OffRL
LRM
280
4
0
05 Aug 2025
SpeechRole: A Large-Scale Dataset and Benchmark for Evaluating Speech Role-Playing Agents
C. Jiang
Jiajun Sun
Yifei Cao
Jiabao Zhuang
Hui Li
Xiaoran Fan
Ming-bo Wen
Junjie Ye
Jiajun Sun
299
0
0
04 Aug 2025
VeOmni: Scaling Any Modality Model Training with Model-Centric Distributed Recipe Zoo
Qianli Ma
Yaowei Zheng
Zhelun Shi
Zhongkai Zhao
Bin Jia
...
Y. Li
Jiacheng Yang
Yanghua Peng
Zhi-Li Zhang
Xin Liu
MoE
VLM
349
3
0
04 Aug 2025
Multimodal Large Language Models for End-to-End Affective Computing: Benchmarking and Boosting with Generative Knowledge Prompting
Miaosen Luo
Jiesen Long
Zequn Li
Yunying Yang
Yuncheng Jiang
Sijie Mai
198
1
0
04 Aug 2025
From Contrast to Commonality: Audio Commonality Captioning for Enhanced Audio-Text Cross-modal Understanding in Multimodal LLMs
Yuhang Jia
Xu Zhang
Yong Qin
Yang Chen
Shiwan Zhao
VLM
203
0
0
03 Aug 2025
Multi-Agent Game Generation and Evaluation via Audio-Visual Recordings
Alexia Jolicoeur-Martineau
VGen
116
0
0
01 Aug 2025
AudioGen-Omni: A Unified Multimodal Diffusion Transformer for Video-Synchronized Audio, Speech, and Song Generation
L. Wang
Jun Wang
Feng Deng
Feng Deng
Chen Zhang
Di Zhang
Kun Gai
DiffM
VGen
746
7
0
01 Aug 2025
ARC-Hunyuan-Video-7B: Structured Video Comprehension of Real-World Shorts
Yuying Ge
Yixiao Ge
Chen Li
Teng Wang
Junfu Pu
...
Xiaojing Zhang
Yangyu Tao
Han Hu
Di Wang
Mingyu Ding
151
13
0
28 Jul 2025
JWB-DH-V1: Benchmark for Joint Whole-Body Talking Avatar and Speech Generation Version 1
Xinhan Di
Kristin Qi
Pengqian Yu
DiffM
VGen
214
0
0
28 Jul 2025
When Tokens Talk Too Much: A Survey of Multimodal Long-Context Token Compression across Images, Videos, and Audios
Kele Shao
Keda Tao
Kejia Zhang
Sicheng Feng
Mu Cai
Yuzhang Shang
Haoxuan You
Can Qin
Yang Sui
Huan Wang
508
11
0
27 Jul 2025
Predicting Brain Responses To Natural Movies With Multimodal LLMs
Cesar Kadir Torrico Villanueva
Jiaxin Cindy Tu
Mihir Tripathy
Connor Lane
Rishab Iyer
Paul S. Scotti
128
3
0
26 Jul 2025
DIFFA: Large Language Diffusion Models Can Listen and Understand
Jiaming Zhou
Hongjie Chen
Shiwan Zhao
Jian Kang
Jie Li
...
Haoqin Sun
Hui Wang
Aobo Kong
Yong Qin
X. Li
208
3
0
24 Jul 2025
GOAT-SLM: A Spoken Language Model with Paralinguistic and Speaker Characteristic Awareness
Hongjie Chen
Zehan Li
Yaodong Song
Wenming Deng
Yitong Yao
...
Chao Wang
Shuangyong Song
Yongxiang Li
Zhongjiang He
Xuelong Li
AuLLM
VLM
255
3
0
24 Jul 2025
VIBE: Video-Input Brain Encoder for fMRI Response Modeling
Daniel Carlstrom Schad
Shrey Dixit
Janis Keck
Viktor Studenyak
Aleksandr Shpilevoi
Andrej Bicanski
240
2
0
23 Jul 2025
STITCH: Simultaneous Thinking and Talking with Chunked Reasoning for Spoken Language Models
Cheng-Han Chiang
Xiaofei Wang
Linjie Li
Chung-Ching Lin
Kevin Qinghong Lin
S. Liu
Zhendong Wang
Zhengyuan Yang
Hung-yi Lee
Lijuan Wang
ReLM
LRM
140
10
0
21 Jul 2025
Pixels, Patterns, but No Poetry: To See The World like Humans
Hongcheng Gao
Longxiang Zhang
Lin Xu
Jingyi Tang
X. Li
...
Xinlong Yang
Ge Wu
Balong Bi
Hongyu Chen
Wentao Zhang
MLLM
LRM
VLM
158
3
0
21 Jul 2025
BusterX++: Towards Unified Cross-Modal AI-Generated Content Detection and Explanation with MLLM
Haiquan Wen
Tianxiao Li
Zhenglin Huang
Yiwei He
Guangliang Cheng
301
2
0
19 Jul 2025
AnyCap Project: A Unified Framework, Dataset, and Benchmark for Controllable Omni-modal Captioning
Yiming Ren
Zhiqiang Lin
Yu Li
Gao Meng
Weiyun Wang
...
Zicheng Lin
Jifeng Dai
Yujiu Yang
Wenhai Wang
Ruihang Chu
176
3
0
17 Jul 2025
UGC-VideoCaptioner: An Omni UGC Video Detail Caption Model and New Benchmarks
Peiran Wu
Yunze Liu
Zhengdong Zhu
Enmin Zhou
Junxiao Shen
209
2
0
15 Jul 2025
DeepOmni: Towards Seamless and Smart Speech Interaction with Adaptive Modality-Specific MoE
Hang Shao
Heting Gao
Yunhang Shen
Jiawei Chen
Zuwei Long
Dong Yang
Ke Li
Xing Sun
AuLLM
MoE
218
2
0
27 Jun 2025
WildSpeech-Bench: Benchmarking End-to-End SpeechLLMs in the Wild
Jian Zhang
Linhao Zhang
Bokai Lei
Chuhan Wu
Aiwei Liu
Wei Jia
Xiao-bin Zhou
AuLLM
LM&MA
243
2
0
27 Jun 2025
RePIC: Reinforced Post-Training for Personalizing Multi-Modal Language Models
Yeongtak Oh
J. Mok
Juhyeon Shin
Juhyeon Shin
Sangha Park
J. Mok
Sungroh Yoon
VLM
388
1
0
23 Jun 2025
video-SALMONN 2: Caption-Enhanced Audio-Visual Large Language Models
Changli Tang
Yixuan Li
Yudong Yang
Jimin Zhuang
Guangzhi Sun
Wei Li
Zejun Ma
Chao Zhang
377
2
0
18 Jun 2025
AviationLLM: An LLM-based Knowledge System for Aviation Training
Jiaáng Wan
Feng Shen
Fujuan Li
Yanjin Sun
Yan Li
Shiwen Zhang
204
1
0
17 Jun 2025
SoundMind: RL-Incentivized Logic Reasoning for Audio-Language Models
Xingjian Diao
Chunhui Zhang
Keyi Kong
Weiyi Wu
Chiyu Ma
Z. Ouyang
Peijun Qing
Soroush Vosoughi
Jiang Gui
AuLLM
OffRL
ReLM
LRM
211
8
0
15 Jun 2025
NoLoCo: No-all-reduce Low Communication Training Method for Large Models
Jari Kolehmainen
Nikolay Blagoev
John Donaghy
Oğuzhan Ersoy
Christopher Nies
278
0
0
12 Jun 2025
VAT-KG: Knowledge-Intensive Multimodal Knowledge Graph Dataset for Retrieval-Augmented Generation
Hyeongcheol Park
Jiyoung Seo
MinHyuk Jang
Hogun Park
Ha Dam Baek
Gyusam Chang
Hyeonsoo Im
Sangpil Kim
305
2
0
11 Jun 2025
Step-Audio-AQAA: a Fully End-to-End Expressive Large Audio Language Model
Ailin Huang
B. Li
Bruce Wang
Boyong Wu
Chao Yan
...
X. Zhang
Yibo Zhu
Daxin Jiang
Shuchang Zhou
Chen-Hao Hu
AuLLM
345
7
0
10 Jun 2025
UITron-Speech: Towards Automated GUI Agents Based on Speech Instructions
Wenkang Han
Zhixiong Zeng
Jing Huang
Shu Jiang
Liming Zheng
Longrong Yang
Haibo Qiu
Chang Yao
Jingyuan Chen
Lin Ma
LM&Ro
266
2
0
10 Jun 2025
DEBATE: A Dataset for Disentangling Textual Ambiguity in Mandarin Through Speech
Haotian Guo
Jing Han
Yongfeng Tu
Shihao Gao
Shengfan Shen
Wulong Xiang
Weihao Gan
Zixing Zhang
137
0
0
09 Jun 2025
Movie Facts and Fibs (MF
2
^2
2
): A Benchmark for Long Movie Understanding
Emmanouil Zaranis
António Farinhas
Saul Santos
Beatriz Canaverde
Miguel Moura Ramos
...
Raffaella Bernardi
Raquel Fernández
Sandro Pezzelle
Vlad Niculae
Andre F. T. Martins
231
3
0
06 Jun 2025
AV-Reasoner: Improving and Benchmarking Clue-Grounded Audio-Visual Counting for MLLMs
Lidong Lu
Guo Chen
Ruoyao Xiao
Yicheng Liu
Tong Lu
VLM
LRM
339
7
0
05 Jun 2025
Previous
1
2
3
4
5
Next