ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2506.09344
77
0

Ming-Omni: A Unified Multimodal Model for Perception and Generation

11 June 2025
Inclusion AI
Biao Gong
Cheng Zou
C. Zheng
Chunluan Zhou
C. Yan
Chunxiang Jin
Chunjie Shen
Dandan Zheng
Fudong Wang
Furong Xu
Guangming Yao
Jun Zhou
Jingdong Chen
Jianxin Sun
Jiajia Liu
Jianjiang Zhu
Jun Peng
Kaixiang Ji
Kaiyou Song
Kaimeng Ren
Libin Wang
Lixiang Ru
Lele Xie
Longhua Tan
Lyuxin Xue
L. xilinx Wang
Mochen Bai
Ning Gao
Pei Chen
Qingpei Guo
Qinglong Zhang
Qiang Xu
Rui Liu
Ruijie Xiong
Sirui Gao
Tinghao Liu
Taisong Li
Weilong Chai
Xinyu Xiao
Xiaomei Wang
Xiaoxue Chen
Xiao Lu
Xiaoyu Li
Xingning Dong
Xuzheng Yu
Yi Yuan
Yuting Gao
Yunxiao Sun
Yipeng Chen
Y. Wu
Yongjie Lyu
Ziping Ma
Zipeng Feng
Zhijiang Fang
Zhihao Qiu
Ziyuan Huang
Z. He
    MLLMAuLLM
ArXiv (abs)PDFHTML
Main:19 Pages
7 Figures
Bibliography:7 Pages
8 Tables
Appendix:2 Pages
Abstract

We propose Ming-Omni, a unified multimodal model capable of processing images, text, audio, and video, while demonstrating strong proficiency in both speech and image generation. Ming-Omni employs dedicated encoders to extract tokens from different modalities, which are then processed by Ling, an MoE architecture equipped with newly proposed modality-specific routers. This design enables a single model to efficiently process and fuse multimodal inputs within a unified framework, thereby facilitating diverse tasks without requiring separate models, task-specific fine-tuning, or structural redesign. Importantly, Ming-Omni extends beyond conventional multimodal models by supporting audio and image generation. This is achieved through the integration of an advanced audio decoder for natural-sounding speech and Ming-Lite-Uni for high-quality image generation, which also allow the model to engage in context-aware chatting, perform text-to-speech conversion, and conduct versatile image editing. Our experimental results showcase Ming-Omni offers a powerful solution for unified perception and generation across all modalities. Notably, our proposed Ming-Omni is the first open-source model we are aware of to match GPT-4o in modality support, and we release all code and model weights to encourage further research and development in the community.

View on arXiv
@article{ai2025_2506.09344,
  title={ Ming-Omni: A Unified Multimodal Model for Perception and Generation },
  author={ Inclusion AI and Biao Gong and Cheng Zou and Chuanyang Zheng and Chunluan Zhou and Canxiang Yan and Chunxiang Jin and Chunjie Shen and Dandan Zheng and Fudong Wang and Furong Xu and GuangMing Yao and Jun Zhou and Jingdong Chen and Jianxin Sun and Jiajia Liu and Jianjiang Zhu and Jun Peng and Kaixiang Ji and Kaiyou Song and Kaimeng Ren and Libin Wang and Lixiang Ru and Lele Xie and Longhua Tan and Lyuxin Xue and Lan Wang and Mochen Bai and Ning Gao and Pei Chen and Qingpei Guo and Qinglong Zhang and Qiang Xu and Rui Liu and Ruijie Xiong and Sirui Gao and Tinghao Liu and Taisong Li and Weilong Chai and Xinyu Xiao and Xiaomei Wang and Xiaoxue Chen and Xiao Lu and Xiaoyu Li and Xingning Dong and Xuzheng Yu and Yi Yuan and Yuting Gao and Yunxiao Sun and Yipeng Chen and Yifei Wu and Yongjie Lyu and Ziping Ma and Zipeng Feng and Zhijiang Fang and Zhihao Qiu and Ziyuan Huang and Zhengyu He },
  journal={arXiv preprint arXiv:2506.09344},
  year={ 2025 }
}
Comments on this paper