302

I Want It That Way! Specifying Nuanced Camera Motions in Video Editing

Main:8 Pages
12 Figures
Bibliography:3 Pages
1 Tables
Abstract

Specifying nuanced and compelling camera motion remains a major hurdle for non-expert creators using generative tools, creating an ``expressive gap" where generic text prompts fail to capture cinematic vision. To address this, we present a novel zero-shot diffusion-based system that enables personalized camera motion transfer from a single reference video onto a user-provided static image. Our technical contribution introduces an intuitive interaction paradigm that bypasses the need for 3D data, predefined trajectories, or complex graphical interfaces. The core pipeline leverages a text-to-video diffusion model, employing a two-phase strategy: 1) a multi-concept learning method using LoRA layers and an orthogonality loss to distinctly capture spatial-temporal characteristics and scene features, and 2) a homography-based refinement strategy to enhance temporal and spatial alignment of the generated video. Extensive evaluation demonstrates the efficacy of our method. In a comparative study with 72 participants, our system was significantly preferred over prior work for both motion accuracy (90.45\%) and scene preservation (70.31\%). A second study confirmed our interface significantly improves usability and creative control for video direction. Our work contributes a robust technical solution and a novel human-centered design, significantly expanding cinematic video editing for diverse users.

View on arXiv
Comments on this paper