38
5

Phantom: Subject-consistent video generation via cross-modal alignment

Abstract

The continuous development of foundational models for video generation is evolving into various applications, with subject-consistent video generation still in the exploratory stage. We refer to this as Subject-to-Video, which extracts subject elements from reference images and generates subject-consistent videos following textual instructions. We believe that the essence of subject-to-video lies in balancing the dual-modal prompts of text and image, thereby deeply and simultaneously aligning both text and visual content. To this end, we propose Phantom, a unified video generation framework for both single- and multi-subject references. Building on existing text-to-video and image-to-video architectures, we redesign the joint text-image injection model and drive it to learn cross-modal alignment via text-image-video triplet data. The proposed method achieves high-fidelity subject-consistent video generation while addressing issues of image content leakage and multi-subject confusion. Evaluation results indicate that our method outperforms other state-of-the-art closed-source commercial solutions. In particular, we emphasize subject consistency in human generation, covering existing ID-preserving video generation while offering enhanced advantages.

View on arXiv
@article{liu2025_2502.11079,
  title={ Phantom: Subject-consistent video generation via cross-modal alignment },
  author={ Lijie Liu and Tianxiang Ma and Bingchuan Li and Zhuowei Chen and Jiawei Liu and Gen Li and Siyu Zhou and Qian He and Xinglong Wu },
  journal={arXiv preprint arXiv:2502.11079},
  year={ 2025 }
}
Comments on this paper