611
v1v2v3 (latest)

LayerCraft: Enhancing Text-to-Image Generation with CoT Reasoning and Layered Object Integration

Main:8 Pages
16 Figures
Bibliography:4 Pages
5 Tables
Appendix:16 Pages
Abstract

Text-to-image (T2I) generation has made remarkable progress, yet existing systems still lack intuitive control over spatial composition, object consistency, and multi-step editing. We present LayerCraft\textbf{LayerCraft}, a modular framework that uses large language models (LLMs) as autonomous agents to orchestrate structured, layered image generation and editing. LayerCraft supports two key capabilities: (1) structured generation\textit{structured generation} from simple prompts via chain-of-thought (CoT) reasoning, enabling it to decompose scenes, reason about object placement, and guide composition in a controllable, interpretable manner; and (2) layered object integration\textit{layered object integration}, allowing users to insert and customize objects -- such as characters or props -- across diverse images or scenes while preserving identity, context, and style. The system comprises a coordinator agent, the ChainArchitect\textbf{ChainArchitect} for CoT-driven layout planning, and the Object Integration Network (OIN)\textbf{Object Integration Network (OIN)} for seamless image editing using off-the-shelf T2I models without retraining. Through applications like batch collage editing and narrative scene generation, LayerCraft empowers non-experts to iteratively design, customize, and refine visual content with minimal manual effort. Code will be released atthis https URL.

View on arXiv
Comments on this paper