299
v1v2v3 (latest)

Sora as a World Model? A Complete Survey on Text-to-Video Generation

Noor Ul Eman
Jingyao Zheng
Sheng Zheng
Lik-Hang Lee
Caiyan Qin
Tae-Ho Kim
Choong Seon Hong
Yang Yang
Heng Tao Shen
Main:21 Pages
7 Figures
Bibliography:14 Pages
2 Tables
Abstract

The evolution of video generation from text, from animating MNIST to simulating the world with Sora, has progressed at a breakneck speed. Here, we systematically discuss how far text-to-video generation technology supports essential requirements in world modeling. We curate 250+ studies on text-based video synthesis and world modeling. We then observe that recent models increasingly support spatial, action, and strategic intelligences in world modeling through adherence to completeness, consistency, invention, as well as human interaction and control. We conclude that text-to-video generation is adept at world modeling, although homework in several aspects, such as the diversity-consistency trade-offs, remains to be addressed.

View on arXiv
Comments on this paper