Qwen2.5-Omni Technical Report

26 March 2025

ArXiv (abs)PDF HTML HuggingFace (164 upvotes)

Papers citing "Qwen2.5-Omni Technical Report"

42 / 242 papers shown

How Far Are We from Generating Missing Modalities with Foundation Models?

304

04 Jun 2025

Is Extending Modality The Right Path Towards Omni-Modality?

281

02 Jun 2025

CReFT-CAD: Boosting Orthographic Projection Reasoning for CAD via Reinforcement Fine-Tuning

272

31 May 2025

ACE-Step: A Step Towards Music Generation Foundation Model

231

28 May 2025

OWL: Probing Cross-Lingual Recall of Memorized Texts via World Literature

279

28 May 2025

DoctorAgent-RL: A Multi-Agent Collaborative Reinforcement Learning System for Multi-Turn Clinical Dialogue

453

26 May 2025

From Alignment to Advancement: Bootstrapping Audio-Language Alignment with Synthetic Data

Chun-Yi Kuan

Hung-yi Lee

AuLLM

303

26 May 2025

SpeakStream: Streaming Text-to-Speech with Interleaved Data

167

25 May 2025

POQD: Performance-Oriented Query Decomposer for Multi-vector retrieval

329

25 May 2025

Flex-Judge: Text-Only Reasoning Unleashes Zero-Shot Multimodal Evaluators

568

24 May 2025

Multimodal Conversation Structure Understanding

323

23 May 2025

Daily-Omni: Towards Audio-Visual Reasoning with Temporal Alignment across Modalities

207

23 May 2025

CosyVoice 3: Towards In-the-wild Speech Generation via Scaling-up and Post-training

...

342

23 May 2025

RBench-V: A Primary Assessment for Visual Reasoning Models with Multi-modal Outputs

...

376

22 May 2025

VocalBench: Benchmarking the Vocal Conversational Abilities for Speech Interaction Models

Qunshan Gu

Yanfeng Wang

Yu Wang

AuLLM

274

21 May 2025

MMAR: A Challenging Benchmark for Deep Reasoning in Speech, Audio, Music, and Their Mix

...

306

19 May 2025

SALMONN-omni: A Standalone Speech LLM without Codec Injection for Full-duplex Conversation

232

17 May 2025

Omni-R1: Do You Really Need Audio to Fine-Tune Your Audio LLM?

334

14 May 2025

WavReward: Spoken Dialogue Models With Generalist Reward Evaluators

...

410

14 May 2025

Bridging Ears and Eyes: Analyzing Audio and Visual Large Language Models to Humans in Visible Sound Recognition and Reducing Their Sensory Gap via Cross-Modal Distillation

279

11 May 2025

EchoInk-R1: Exploring Audio-Visual Reasoning in Multimodal LLMs via Reinforcement Learning

348

07 May 2025

VITA-Audio: Fast Interleaved Cross-Modal Token Generation for Efficient Large Speech-Language Model

...

247

06 May 2025

Reinforced MLLM: A Survey on RL-Based Reasoning in Multimodal Large Language Models

551

30 Apr 2025

Kimi-Audio Technical Report

...

427

122

25 Apr 2025

TimeChat-Online: 80% Visual Tokens are Naturally Redundant in Streaming Videos

...

283

24 Apr 2025

MMInference: Accelerating Pre-filling for Long-Context VLMs via Modality-Aware Permutation Sparse Attention

...

356

22 Apr 2025

Pre-DPO: Improving Data Utilization in Direct Preference Optimization Using a Guiding Reference Model

355

22 Apr 2025

VocalNet: Speech LLM with Multi-Token Prediction for Faster and High-Quality Generation

973

05 Apr 2025

Investigating and Enhancing Vision-Audio Capability in Omnimodal Large Language ModelsAnnual Meeting of the Association for Computational Linguistics (ACL), 2025

379

27 Feb 2025

Enhancing Speech Large Language Models with Prompt-Aware Mixture of Audio Encoders

...

443

21 Feb 2025

Qwen2.5-VL Technical Report

...

720

2,913

20 Feb 2025

Baichuan-Omni-1.5 Technical Report

Tao Zhang

...

328

28 Jan 2025

OCRBench v2: An Improved Benchmark for Evaluating Large Multimodal Models on Visual Text Localization and Reasoning

...

358

31 Dec 2024

VideoSAVi: Self-Aligned Video Language Models without Human Supervision

Yogesh Kulkarni

Pooyan Fazli

VLM

606

01 Dec 2024

F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow MatchingAnnual Meeting of the Association for Computational Linguistics (ACL), 2024

615

277

09 Oct 2024

Recent Advances in Speech Language Models: A SurveyAnnual Meeting of the Association for Computational Linguistics (ACL), 2024

544

01 Oct 2024

MIO: A Foundation Model on Multimodal Tokens

...

463

26 Sep 2024

OmniBench: Towards The Future of Universal Omni-Language Models

...

609

23 Sep 2024

Benchmarking Sub-Genre Classification For Mainstage Dance Music

156

10 Sep 2024

MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans?International Conference on Learning Representations (ICLR), 2024

Yi-Fan Zhang

Huanyu Zhang

Haochen Tian

Chaoyou Fu

Shuangqing Zhang

...

Qingsong Wen

Zhang Zhang

Liwen Wang

Rong Jin

Tieniu Tan

OffRL

363

136

23 Aug 2024

VLMEvalKit: An Open-Source Toolkit for Evaluating Large Multi-Modality Models

...

727

359

16 Jul 2024

SpeechVerse: A Large-scale Generalizable Audio Language Model

...

474

14 May 2024