3

Rethinking UMM Visual Generation: Masked Modeling for Efficient Image-Only Pre-training

Peng Sun
Jun Xie
Tao Lin
Main:8 Pages
7 Figures
Bibliography:4 Pages
10 Tables
Appendix:8 Pages
Abstract

Unified Multimodal Models (UMMs) are often constrained by the pre-training of their visual generation components\textbf{visual generation components}, which typically relies on inefficient paradigms and scarce, high-quality text-image paired data. In this paper, we systematically analyze pre-training recipes for UMM visual generation\textbf{UMM visual generation} and identify these two issues as the major bottlenecks.To address them, we propose Image-Only Training for UMMs (IOMM)\textbf{Image-Only Training for UMMs (IOMM)}, a data-efficient two-stage training framework.The first stage pre-trains the visual generative component exclusively\textbf{exclusively} using abundant unlabeled image-only data, thereby removing the dependency on paired data for this costly phase\textbf{for this costly phase}. The second stage fine-tunes the model using a mixture of unlabeled images and a small curated set of text-image pairs, leading to improved instruction alignment and generative quality.Extensive experiments show that IOMM not only improves training efficiency but also achieves state-of-the-art (SOTA) performance.For example, our IOMM-B (3.6B) model was trained from scratch using only 1050\sim \textbf{1050} H800 GPU hours (with the vast majority, 1000\textbf{1000} hours, dedicated to the efficient image-only pre-training stage\textbf{image-only pre-training stage}). It achieves 0.89\textbf{0.89} on GenEval and 0.55\textbf{0.55} on WISE--surpassing strong baselines such as BAGEL-7B (0.82 & 0.55) and BLIP3-o-4B (0.84 & 0.50).Code is available \href\href{this https URL}{this https URL}.

View on arXiv
Comments on this paper