Instruction-based Image Manipulation by Watching How Things Move

Computer Vision and Pattern Recognition (CVPR), 2024

16 December 2024

ArXiv (abs)PDF HTML HuggingFace (2 upvotes)Github

Main:8 Pages

9 Figures

Bibliography:2 Pages

4 Tables

Abstract

This paper introduces a novel dataset construction pipeline that samples pairs of frames from videos and uses multimodal large language models (MLLMs) to generate editing instructions for training instruction-based image manipulation models. Video frames inherently preserve the identity of subjects and scenes, ensuring consistent content preservation during editing. Additionally, video data captures diverse, natural dynamics-such as non-rigid subject motion and complex camera movements-that are difficult to model otherwise, making it an ideal source for scalable dataset construction. Using this approach, we create a new dataset to train InstructMove, a model capable of instruction-based complex manipulations that are difficult to achieve with synthetically generated datasets. Our model demonstrates state-of-the-art performance in tasks such as adjusting subject poses, rearranging elements, and altering camera perspectives.

View on arXiv

Comments on this paper