Communities
Connect sessions
AI calendar
Organizations
Join Slack
Contact Sales

Terms and Conditions

Twitter GitHub LinkedIn Bluesky Youtube

© 2026 ResearchTrend.AI, All rights reserved.

Home
Papers
2503.20215
Cited By

Qwen2.5-Omni Technical Report

Qwen2.5-Omni Technical Report

26 March 2025

ArXiv (abs)PDF HTML HuggingFace (164 upvotes)

Papers citing "Qwen2.5-Omni Technical Report"

50 / 242 papers shown

Kwai Keye-VL 1.5 Technical Report

Kwai Keye-VL 1.5 Technical Report

...

325

15

0

01 Sep 2025

WoW-Bench: Evaluating Fine-Grained Acoustic Perception in Audio-Language Models via Marine Mammal Vocalizations

WoW-Bench: Evaluating Fine-Grained Acoustic Perception in Audio-Language Models via Marine Mammal Vocalizations

Chao-Han Huck Yang

114

0

0

28 Aug 2025

ChipChat: Low-Latency Cascaded Conversational Agent in MLX

ChipChat: Low-Latency Cascaded Conversational Agent in MLX

Tatiana Likhomanenko

Zakaria Aldeneh

105

1

0

26 Aug 2025

SEAM: Semantically Equivalent Across Modalities Benchmark for Vision-Language Models

SEAM: Semantically Equivalent Across Modalities Benchmark for Vision-Language Models

Ashton Anderson

142

1

0

25 Aug 2025

Speech-Based Cognitive Screening: A Systematic Evaluation of LLM Adaptation Strategies

Speech-Based Cognitive Screening: A Systematic Evaluation of LLM Adaptation Strategies

Fatemeh Taherinezhad

Mohamad Javad Momeni Nezhad

Yasaman Haghbin

Hossein Azadmaleki

Maryam Zolnoori

90

1

0

24 Aug 2025

TaDiCodec: Text-aware Diffusion Speech Tokenizer for Speech Language Modeling

TaDiCodec: Text-aware Diffusion Speech Tokenizer for Speech Language Modeling

228

4

0

22 Aug 2025

Mini-Omni-Reasoner: Token-Level Thinking-in-Speaking in Large Speech Models

Mini-Omni-Reasoner: Token-Level Thinking-in-Speaking in Large Speech Models

264

7

0

18 Aug 2025

RadarQA: Multi-modal Quality Analysis of Weather Radar Forecasts

RadarQA: Multi-modal Quality Analysis of Weather Radar Forecasts

92

3

0

17 Aug 2025

Audio Flamingo Sound-CoT Technical Report: Improving Chain-of-Thought Reasoning in Sound Understanding

Audio Flamingo Sound-CoT Technical Report: Improving Chain-of-Thought Reasoning in Sound Understanding

Bryan Catanzaro

178

2

0

15 Aug 2025

Seeing, Listening, Remembering, and Reasoning: A Multimodal Agent with Long-Term Memory

Seeing, Listening, Remembering, and Reasoning: A Multimodal Agent with Long-Term Memory

346

8

0

13 Aug 2025

MME-Emotion: A Holistic Evaluation Benchmark for Emotional Intelligence in Multimodal Large Language Models

MME-Emotion: A Holistic Evaluation Benchmark for Emotional Intelligence in Multimodal Large Language Models

...

153

3

0

11 Aug 2025

Omni-SafetyBench: A Benchmark for Safety Evaluation of Audio-Visual Large Language Models

Omni-SafetyBench: A Benchmark for Safety Evaluation of Audio-Visual Large Language Models

...

197

1

0

10 Aug 2025

LLMCARE: early detection of cognitive impairment via transformer models enhanced by LLM-generated synthetic data

LLMCARE: early detection of cognitive impairment via transformer models enhanced by LLM-generated synthetic dataFrontiers in Artificial Intelligence (Front. Artif. Intell.), 2025

Hossein Azadmaleki

Yasaman Haghbin

Fatemeh Taherinezhad

Mohamad Javad Momeni Nezhad

...

Yadollah Yaghoobzadeh

Abdol-Hossein Vahabie

Masoud Rouhizadeh

Maryam Zolnoori

143

0

0

08 Aug 2025

Training-Free Multimodal Large Language Model Orchestration

Training-Free Multimodal Large Language Model Orchestration

137

0

0

06 Aug 2025

OmniPlay: Benchmarking Omni-Modal Models on Omni-Modal Game Playing

OmniPlay: Benchmarking Omni-Modal Models on Omni-Modal Game Playing

189

0

0

06 Aug 2025

RealTalk-CN: A Realistic Chinese Speech-Text Dialogue Benchmark With Cross-Modal Interaction Analysis

RealTalk-CN: A Realistic Chinese Speech-Text Dialogue Benchmark With Cross-Modal Interaction Analysis

71

0

0

06 Aug 2025

ESDD 2026: Environmental Sound Deepfake Detection Challenge Evaluation Plan

ESDD 2026: Environmental Sound Deepfake Detection Challenge Evaluation Plan

Rohan Kumar Das

119

5

0

06 Aug 2025

MiDashengLM: Efficient Audio Understanding with General Audio Captions

MiDashengLM: Efficient Audio Understanding with General Audio Captions

Heinrich Dinkel

AuLLM AI4TS VLM

422

13

0

06 Aug 2025

AVATAR: Reinforcement Learning to See, Hear, and Reason Over Video

Yogesh Kulkarni

280

4

0

05 Aug 2025

SpeechRole: A Large-Scale Dataset and Benchmark for Evaluating Speech Role-Playing Agents

SpeechRole: A Large-Scale Dataset and Benchmark for Evaluating Speech Role-Playing Agents

299

0

0

04 Aug 2025

VeOmni: Scaling Any Modality Model Training with Model-Centric Distributed Recipe Zoo

VeOmni: Scaling Any Modality Model Training with Model-Centric Distributed Recipe Zoo

...

349

3

0

04 Aug 2025

Multimodal Large Language Models for End-to-End Affective Computing: Benchmarking and Boosting with Generative Knowledge Prompting

Multimodal Large Language Models for End-to-End Affective Computing: Benchmarking and Boosting with Generative Knowledge Prompting

198

1

0

04 Aug 2025

From Contrast to Commonality: Audio Commonality Captioning for Enhanced Audio-Text Cross-modal Understanding in Multimodal LLMs

From Contrast to Commonality: Audio Commonality Captioning for Enhanced Audio-Text Cross-modal Understanding in Multimodal LLMs

203

0

0

03 Aug 2025

Multi-Agent Game Generation and Evaluation via Audio-Visual Recordings

Multi-Agent Game Generation and Evaluation via Audio-Visual Recordings

Alexia Jolicoeur-Martineau

116

0

0

01 Aug 2025

AudioGen-Omni: A Unified Multimodal Diffusion Transformer for Video-Synchronized Audio, Speech, and Song Generation

AudioGen-Omni: A Unified Multimodal Diffusion Transformer for Video-Synchronized Audio, Speech, and Song Generation

746

7

0

01 Aug 2025

ARC-Hunyuan-Video-7B: Structured Video Comprehension of Real-World Shorts

ARC-Hunyuan-Video-7B: Structured Video Comprehension of Real-World Shorts

...

151

13

0

28 Jul 2025

JWB-DH-V1: Benchmark for Joint Whole-Body Talking Avatar and Speech Generation Version 1

214

0

0

28 Jul 2025

When Tokens Talk Too Much: A Survey of Multimodal Long-Context Token Compression across Images, Videos, and Audios

When Tokens Talk Too Much: A Survey of Multimodal Long-Context Token Compression across Images, Videos, and Audios

508

11

0

27 Jul 2025

Predicting Brain Responses To Natural Movies With Multimodal LLMs

Predicting Brain Responses To Natural Movies With Multimodal LLMs

Cesar Kadir Torrico Villanueva

Jiaxin Cindy Tu

128

3

0

26 Jul 2025

DIFFA: Large Language Diffusion Models Can Listen and Understand

DIFFA: Large Language Diffusion Models Can Listen and Understand

...

208

3

0

24 Jul 2025

GOAT-SLM: A Spoken Language Model with Paralinguistic and Speaker Characteristic Awareness

GOAT-SLM: A Spoken Language Model with Paralinguistic and Speaker Characteristic Awareness

...

Shuangyong Song

255

3

0

24 Jul 2025

VIBE: Video-Input Brain Encoder for fMRI Response Modeling

VIBE: Video-Input Brain Encoder for fMRI Response Modeling

Daniel Carlstrom Schad

Viktor Studenyak

Aleksandr Shpilevoi

Andrej Bicanski

240

2

0

23 Jul 2025

STITCH: Simultaneous Thinking and Talking with Chunked Reasoning for Spoken Language Models

STITCH: Simultaneous Thinking and Talking with Chunked Reasoning for Spoken Language Models

Cheng-Han Chiang

Chung-Ching Lin

Kevin Qinghong Lin

140

10

0

21 Jul 2025

Pixels, Patterns, but No Poetry: To See The World like Humans

Pixels, Patterns, but No Poetry: To See The World like Humans

Longxiang Zhang

...

158

3

0

21 Jul 2025

BusterX++: Towards Unified Cross-Modal AI-Generated Content Detection and Explanation with MLLM

BusterX++: Towards Unified Cross-Modal AI-Generated Content Detection and Explanation with MLLM

Guangliang Cheng

301

2

0

19 Jul 2025

AnyCap Project: A Unified Framework, Dataset, and Benchmark for Controllable Omni-modal Captioning

AnyCap Project: A Unified Framework, Dataset, and Benchmark for Controllable Omni-modal Captioning

...

176

3

0

17 Jul 2025

UGC-VideoCaptioner: An Omni UGC Video Detail Caption Model and New Benchmarks

UGC-VideoCaptioner: An Omni UGC Video Detail Caption Model and New Benchmarks

209

2

0

15 Jul 2025

DeepOmni: Towards Seamless and Smart Speech Interaction with Adaptive Modality-Specific MoE

DeepOmni: Towards Seamless and Smart Speech Interaction with Adaptive Modality-Specific MoE

218

2

0

27 Jun 2025

WildSpeech-Bench: Benchmarking End-to-End SpeechLLMs in the Wild

WildSpeech-Bench: Benchmarking End-to-End SpeechLLMs in the Wild

243

2

0

27 Jun 2025

RePIC: Reinforced Post-Training for Personalizing Multi-Modal Language Models

RePIC: Reinforced Post-Training for Personalizing Multi-Modal Language Models

388

1

0

23 Jun 2025

video-SALMONN 2: Caption-Enhanced Audio-Visual Large Language Models

video-SALMONN 2: Caption-Enhanced Audio-Visual Large Language Models

377

2

0

18 Jun 2025

AviationLLM: An LLM-based Knowledge System for Aviation Training

AviationLLM: An LLM-based Knowledge System for Aviation Training

204

1

0

17 Jun 2025

SoundMind: RL-Incentivized Logic Reasoning for Audio-Language Models

SoundMind: RL-Incentivized Logic Reasoning for Audio-Language Models

Soroush Vosoughi

AuLLM OffRL ReLM LRM

211

8

0

15 Jun 2025

NoLoCo: No-all-reduce Low Communication Training Method for Large Models

NoLoCo: No-all-reduce Low Communication Training Method for Large Models

Jari Kolehmainen

Nikolay Blagoev

Christopher Nies

278

0

0

12 Jun 2025

VAT-KG: Knowledge-Intensive Multimodal Knowledge Graph Dataset for Retrieval-Augmented Generation

VAT-KG: Knowledge-Intensive Multimodal Knowledge Graph Dataset for Retrieval-Augmented Generation

Hyeongcheol Park

305

2

0

11 Jun 2025

Step-Audio-AQAA: a Fully End-to-End Expressive Large Audio Language Model

Step-Audio-AQAA: a Fully End-to-End Expressive Large Audio Language Model

...

345

7

0

10 Jun 2025

UITron-Speech: Towards Automated GUI Agents Based on Speech Instructions

UITron-Speech: Towards Automated GUI Agents Based on Speech Instructions

266

2

0

10 Jun 2025

DEBATE: A Dataset for Disentangling Textual Ambiguity in Mandarin Through Speech

DEBATE: A Dataset for Disentangling Textual Ambiguity in Mandarin Through Speech

137

0

0

09 Jun 2025

Movie Facts and Fibs (MF$^2$): A Benchmark for Long Movie Understanding

Movie Facts and Fibs (MF

^2

): A Benchmark for Long Movie Understanding

Emmanouil Zaranis

António Farinhas

Beatriz Canaverde

Miguel Moura Ramos

...

Raffaella Bernardi

Raquel Fernández

Sandro Pezzelle

Andre F. T. Martins

231

3

0

06 Jun 2025

AV-Reasoner: Improving and Benchmarking Clue-Grounded Audio-Visual Counting for MLLMs

AV-Reasoner: Improving and Benchmarking Clue-Grounded Audio-Visual Counting for MLLMs

339

7

0

05 Jun 2025