2

Show, Don't Tell: Morphing Latent Reasoning into Image Generation

Harold Haodong Chen
Xinxiang Yin
Wen-Jie Shu
Hongfei Zhang
Zixin Zhang
Chenfei Liao
Litao Guo
Qifeng Chen
Ying-Cong Chen
Main:8 Pages
9 Figures
Bibliography:3 Pages
8 Tables
Appendix:9 Pages
Abstract

Text-to-image (T2I) generation has achieved remarkable progress, yet existing methods often lack the ability to dynamically reason and refine during generation--a hallmark of human creativity. Current reasoning-augmented paradigms most rely on explicit thought processes, where intermediate reasoning is decoded into discrete text at fixed steps with frequent image decoding and re-encoding, leading to inefficiencies, information loss, and cognitive mismatches. To bridge this gap, we introduce LatentMorph, a novel framework that seamlessly integrates implicit latent reasoning into the T2I generation process. At its core, LatentMorph introduces four lightweight components: (i) a condenser for summarizing intermediate generation states into compact visual memory, (ii) a translator for converting latent thoughts into actionable guidance, (iii) a shaper for dynamically steering next image token predictions, and (iv) an RL-trained invoker for adaptively determining when to invoke reasoning. By performing reasoning entirely in continuous latent spaces, LatentMorph avoids the bottlenecks of explicit reasoning and enables more adaptive self-refinement. Extensive experiments demonstrate that LatentMorph (I) enhances the base model Janus-Pro by 16%16\% on GenEval and 25%25\% on T2I-CompBench; (II) outperforms explicit paradigms (e.g., TwiG) by 15%15\% and 11%11\% on abstract reasoning tasks like WISE and IPV-Txt, (III) while reducing inference time by 44%44\% and token consumption by 51%51\%; and (IV) exhibits 71%71\% cognitive alignment with human intuition on reasoning invocation.

View on arXiv
Comments on this paper