ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2503.10639
81
2

GoT: Unleashing Reasoning Capability of Multimodal Large Language Model for Visual Generation and Editing

13 March 2025
Rongyao Fang
Chengqi Duan
Kun Wang
Linjiang Huang
Hao Li
Shilin Yan
Hao Tian
Xingyu Zeng
R. Zhao
Jifeng Dai
Xihui Liu
Hongsheng Li
    MLLM
    ReLM
    LRM
ArXivPDFHTML
Abstract

Current image generation and editing methods primarily process textual prompts as direct inputs without reasoning about visual composition and explicit operations. We present Generation Chain-of-Thought (GoT), a novel paradigm that enables generation and editing through an explicit language reasoning process before outputting images. This approach transforms conventional text-to-image generation and editing into a reasoning-guided framework that analyzes semantic relationships and spatial arrangements. We define the formulation of GoT and construct large-scale GoT datasets containing over 9M samples with detailed reasoning chains capturing semantic-spatial relationships. To leverage the advantages of GoT, we implement a unified framework that integrates Qwen2.5-VL for reasoning chain generation with an end-to-end diffusion model enhanced by our novel Semantic-Spatial Guidance Module. Experiments show our GoT framework achieves excellent performance on both generation and editing tasks, with significant improvements over baselines. Additionally, our approach enables interactive visual generation, allowing users to explicitly modify reasoning steps for precise image adjustments. GoT pioneers a new direction for reasoning-driven visual generation and editing, producing images that better align with human intent. To facilitate future research, we make our datasets, code, and pretrained models publicly available atthis https URL.

View on arXiv
@article{fang2025_2503.10639,
  title={ GoT: Unleashing Reasoning Capability of Multimodal Large Language Model for Visual Generation and Editing },
  author={ Rongyao Fang and Chengqi Duan and Kun Wang and Linjiang Huang and Hao Li and Shilin Yan and Hao Tian and Xingyu Zeng and Rui Zhao and Jifeng Dai and Xihui Liu and Hongsheng Li },
  journal={arXiv preprint arXiv:2503.10639},
  year={ 2025 }
}
Comments on this paper