ProPhy: Progressive Physical Alignment for Dynamic World Simulation

5 December 2025

Zijun Wang

Panwen Hu

Jing Wang

Terry Jingchen Zhang

Yuhao Cheng

Long Chen

Yiqiang Yan

Zutao Jiang

Hanhui Li

Xiaodan Liang

VGen

ArXiv (abs)PDF HTML

Main:8 Pages

15 Figures

Bibliography:2 Pages

4 Tables

Appendix:6 Pages

Abstract

Recent advances in video generation have shown remarkable potential for constructing world simulators. However, current models still struggle to produce physically consistent results, particularly when handling large-scale or complex dynamics. This limitation arises primarily because existing approaches respond isotropically to physical prompts and neglect the fine-grained alignment between generated content and localized physical cues. To address these challenges, we propose ProPhy, a Progressive Physical Alignment Framework that enables explicit physics-aware conditioning and anisotropic generation. ProPhy employs a two-stage Mixture-of-Physics-Experts (MoPE) mechanism for discriminative physical prior extraction, where Semantic Experts infer semantic-level physical principles from textual descriptions, and Refinement Experts capture token-level physical dynamics. This mechanism allows the model to learn fine-grained, physics-aware video representations that better reflect underlying physical laws. Furthermore, we introduce a physical alignment strategy that transfers the physical reasoning capabilities of vision-language models (VLMs) into the Refinement Experts, facilitating a more accurate representation of dynamic physical phenomena. Extensive experiments on physics-aware video generation benchmarks demonstrate that ProPhy produces more realistic, dynamic, and physically coherent results than existing state-of-the-art methods.

View on arXiv

Comments on this paper