ReDiStory: Region-Disentangled Diffusion for Consistent Visual Story Generation

1 February 2026

Ayushman Sarkar

Zhenyu Yu

Chu Chen

Wei Tang

Kangning Cui

Mohd Yamani Idna Idris

DiffM

VGen

ArXiv (abs)PDF HTML Github

Main:5 Pages

5 Figures

Bibliography:1 Pages

3 Tables

Appendix:1 Pages

Abstract

Generating coherent visual stories requires maintaining subject identity across multiple images while preserving frame-specific semantics. Recent training-free methods concatenate identity and frame prompts into a unified representation, but this often introduces inter-frame semantic interference that weakens identity preservation in complex stories. We propose ReDiStory, a training-free framework that improves multi-frame story generation via inference-time prompt embedding reorganization. ReDiStory explicitly decomposes text embeddings into identity-related and frame-specific components, then decorrelates frame embeddings by suppressing shared directions across frames. This reduces cross-frame interference without modifying diffusion parameters or requiring additional supervision. Under identical diffusion backbones and inference settings, ReDiStory improves identity consistency while maintaining prompt fidelity. Experiments on the ConsiStory+ benchmark show consistent gains over 1Prompt1Story on multiple identity consistency metrics. Code is available at:this https URL

View on arXiv

Comments on this paper