Point-Driven Interactive Text and Image Layer Editing Using Diffusion Models

We present DanceText, a training-free framework for multilingual text editing in images, designed to support complex geometric transformations and achieve seamless foreground-background integration. While diffusion-based generative models have shown promise in text-guided image synthesis, they often lack controllability and fail to preserve layout consistency under non-trivial manipulations such as rotation, translation, scaling, and warping. To address these limitations, DanceText introduces a layered editing strategy that separates text from the background, allowing geometric transformations to be performed in a modular and controllable manner. A depth-aware module is further proposed to align appearance and perspective between the transformed text and the reconstructed background, enhancing photorealism and spatial consistency. Importantly, DanceText adopts a fully training-free design by integrating pretrained modules, allowing flexible deployment without task-specific fine-tuning. Extensive experiments on the AnyWord-3M benchmark demonstrate that our method achieves superior performance in visual quality, especially under large-scale and complex transformation scenarios.
View on arXiv@article{yu2025_2504.14108, title={ Point-Driven Interactive Text and Image Layer Editing Using Diffusion Models }, author={ Zhenyu Yu and Mohd Yamani Idna Idris and Pei Wang and Yuelong Xia }, journal={arXiv preprint arXiv:2504.14108}, year={ 2025 } }