ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2507.19468
151
22

Back to the Features: DINO as a Foundation for Video World Models

25 July 2025
Federico Baldassarre
Marc Szafraniec
Basile Terver
Vasil Khalidov
Francisco Massa
Yann LeCun
Patrick Labatut
Maximilian Seitzer
Piotr Bojanowski
    VGen
ArXiv (abs)PDFHTML
Main:9 Pages
7 Figures
Bibliography:4 Pages
10 Tables
Appendix:11 Pages
Abstract

We present DINO-world, a powerful generalist video world model trained to predict future frames in the latent space of DINOv2. By leveraging a pre-trained image encoder and training a future predictor on a large-scale uncurated video dataset, DINO-world learns the temporal dynamics of diverse scenes, from driving and indoor scenes to simulated environments. We show that DINO-world outperforms previous models on a variety of video prediction benchmarks, e.g. segmentation and depth forecasting, and demonstrates strong understanding of intuitive physics. Furthermore, we show that it is possible to fine-tune the predictor on observation-action trajectories. The resulting action-conditioned world model can be used for planning by simulating candidate trajectories in latent space.

View on arXiv
Comments on this paper