ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2510.12796
93
6
v1v2 (latest)

DriveVLA-W0: World Models Amplify Data Scaling Law in Autonomous Driving

14 October 2025
Yingyan Li
Shuyao Shang
Weisong Liu
Bing Zhan
Haochen Wang
Yuqi Wang
Yuntao Chen
X. Wang
Yasong An
Chufeng Tang
Lu Hou
Lue Fan
Zhaoxiang Zhang
    VLM
ArXiv (abs)PDFHTMLHuggingFace (11 upvotes)Github (10★)
Main:9 Pages
12 Figures
Bibliography:5 Pages
9 Tables
Appendix:7 Pages
Abstract

Scaling Vision-Language-Action (VLA) models on large-scale data offers a promising path to achieving a more generalized driving intelligence. However, VLA models are limited by a ``supervision deficit'': the vast model capacity is supervised by sparse, low-dimensional actions, leaving much of their representational power underutilized. To remedy this, we propose \textbf{DriveVLA-W0}, a training paradigm that employs world modeling to predict future images. This task generates a dense, self-supervised signal that compels the model to learn the underlying dynamics of the driving environment. We showcase the paradigm's versatility by instantiating it for two dominant VLA archetypes: an autoregressive world model for VLAs that use discrete visual tokens, and a diffusion world model for those operating on continuous visual features. Building on the rich representations learned from world modeling, we introduce a lightweight action expert to address the inference latency for real-time deployment. Extensive experiments on the NAVSIM v1/v2 benchmark and a 680x larger in-house dataset demonstrate that DriveVLA-W0 significantly outperforms BEV and VLA baselines. Crucially, it amplifies the data scaling law, showing that performance gains accelerate as the training dataset size increases.

View on arXiv
Comments on this paper