Layer-Aware Video Composition via Split-then-Merge

25 November 2025

Ozgur Kara

ArXiv (abs)PDF HTML HuggingFace (2 upvotes)

Main:8 Pages

10 Figures

Bibliography:4 Pages

5 Tables

Appendix:4 Pages

Abstract

We present Split-then-Merge (StM), a novel framework designed to enhance control in generative video composition and address its data scarcity problem. Unlike conventional methods relying on annotated datasets or handcrafted rules, StM splits a large corpus of unlabeled videos into dynamic foreground and background layers, then self-composes them to learn how dynamic subjects interact with diverse scenes. This process enables the model to learn the complex compositional dynamics required for realistic video generation. StM introduces a novel transformation-aware training pipeline that utilizes a multi-layer fusion and augmentation to achieve affordance-aware composition, alongside an identity-preservation loss that maintains foreground fidelity during blending. Experiments show StM outperforms SoTA methods in both quantitative benchmarks and in humans/VLLM-based qualitative evaluations. More details are available at our project page:this https URL

View on arXiv

Comments on this paper