Sora as a World Model? A Complete Survey on Text-to-Video Generation
Noor Ul Eman
Jingyao Zheng
Sheng Zheng
Lik-Hang Lee
Caiyan Qin
Tae-Ho Kim
Choong Seon Hong
Yang Yang
Heng Tao Shen
- EGVMVGen
Main:21 Pages
7 Figures
Bibliography:14 Pages
2 Tables
Abstract
The evolution of video generation from text, from animating MNIST to simulating the world with Sora, has progressed at a breakneck speed. Here, we systematically discuss how far text-to-video generation technology supports essential requirements in world modeling. We curate 250+ studies on text-based video synthesis and world modeling. We then observe that recent models increasingly support spatial, action, and strategic intelligences in world modeling through adherence to completeness, consistency, invention, as well as human interaction and control. We conclude that text-to-video generation is adept at world modeling, although homework in several aspects, such as the diversity-consistency trade-offs, remains to be addressed.
View on arXivComments on this paper
