ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2505.05540
37
0

Benchmarking Vision, Language, & Action Models in Procedurally Generated, Open Ended Action Environments

8 May 2025
Pranav Guruprasad
Yangyue Wang
Sudipta Chowdhury
Harshvardhan Sikka
    LM&Ro
    VLM
ArXivPDFHTML
Abstract

Vision-language-action (VLA) models represent an important step toward general-purpose robotic systems by integrating visual perception, language understanding, and action execution. However, systematic evaluation of these models, particularly their zero-shot generalization capabilities in out-of-distribution (OOD) environments, remains limited. In this paper, we introduce MultiNet v0.2, a comprehensive benchmark designed to evaluate and analyze the generalization performance of state-of-the-art VLM and VLA models-including GPT-4o, GPT-4.1, OpenVLA,Pi0 Base, and Pi0 FAST-on diverse procedural tasks from the Procgen benchmark. Our analysis reveals several critical insights: (1) all evaluated models exhibit significant limitations in zero-shot generalization to OOD tasks, with performance heavily influenced by factors such as action representation and task complexit; (2) VLAs generally outperform other models due to their robust architectural design; and (3) VLM variants demonstrate substantial improvements when constrained appropriately, highlighting the sensitivity of model performance to precise prompt engineering.

View on arXiv
@article{guruprasad2025_2505.05540,
  title={ Benchmarking Vision, Language, & Action Models in Procedurally Generated, Open Ended Action Environments },
  author={ Pranav Guruprasad and Yangyue Wang and Sudipta Chowdhury and Harshvardhan Sikka },
  journal={arXiv preprint arXiv:2505.05540},
  year={ 2025 }
}
Comments on this paper