Ming-Lite-Uni: Advancements in Unified Architecture for Natural Multimodal Interaction

5 May 2025

Abstract

We introduce Ming-Lite-Uni, an open-source multimodal framework featuring a newly designed unified visual generator and a native multimodal autoregressive model tailored for unifying vision and language. Specifically, this project provides an open-source implementation of the integrated MetaQueries and M2-omni framework, while introducing the novel multi-scale learnable tokens and multi-scale representation alignment strategy. By leveraging a fixed MLLM and a learnable diffusion model, Ming-Lite-Uni enables native multimodal AR models to perform both text-to-image generation and instruction based image editing tasks, expanding their capabilities beyond pure visual understanding. Our experimental results demonstrate the strong performance of Ming-Lite-Uni and illustrate the impressive fluid nature of its interactive process. All code and model weights are open-sourced to foster further exploration within the community. Notably, this work aligns with concurrent multimodal AI milestones - such as ChatGPT-4o with native image generation updated in March 25, 2025 - underscoring the broader significance of unified models like Ming-Lite-Uni on the path toward AGI. Ming-Lite-Uni is in alpha stage and will soon be further refined.

View on arXiv

@article{ai2025_2505.02471,
  title={ Ming-Lite-Uni: Advancements in Unified Architecture for Natural Multimodal Interaction },
  author={ Inclusion AI and Biao Gong and Cheng Zou and Dandan Zheng and Hu Yu and Jingdong Chen and Jianxin Sun and Junbo Zhao and Jun Zhou and Kaixiang Ji and Lixiang Ru and Libin Wang and Qingpei Guo and Rui Liu and Weilong Chai and Xinyu Xiao and Ziyuan Huang },
  journal={arXiv preprint arXiv:2505.02471},
  year={ 2025 }
}

Comments on this paper