v1v2v3 (latest)

Textual and Visual Prompt Fusion for Image Editing via Step-Wise Alignment

30 August 2023

Jie Li

Fan Yang

Robert C. Qiu

DiffM

ArXiv (abs)PDF HTML

Abstract

The use of denoising diffusion models is becoming increasingly popular in the field of image editing. However, current approaches often rely on either image-guided methods, which provide a visual reference but lack control over semantic consistency, or text-guided methods, which ensure alignment with the text guidance but compromise visual quality. To resolve this issue, we propose a framework that integrates a fusion of generated visual references and text guidance into the semantic latent space of a \textit{frozen} pre-trained diffusion model. Using only a tiny neural network, our framework provides control over diverse content and attributes, driven intuitively by the text prompt. Compared to state-of-the-art methods, the framework generates images of higher quality while providing realistic editing effects across various benchmark datasets.

View on arXiv

Comments on this paper