ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2505.19080
143
4

ReFineVLA: Reasoning-Aware Teacher-Guided Transfer Fine-Tuning

25 May 2025
Tuan V. Vo
T. Nguyen
Khang Nguyen
Duy Ho Minh Nguyen
Minh Nhat Vu
    LRM
ArXiv (abs)PDFHTML
Main:14 Pages
7 Figures
Bibliography:1 Pages
2 Tables
Appendix:1 Pages
Abstract

Vision-Language-Action (VLA) models have gained much attention from the research community thanks to their strength in translating multimodal observations with linguistic instructions into robotic actions. Despite their recent advancements, VLAs often overlook the explicit reasoning and only learn the functional input-action mappings, omitting these crucial logical steps for interpretability and generalization for complex, long-horizon manipulation tasks. In this work, we propose \textit{ReFineVLA}, a multimodal reasoning-aware framework that fine-tunes VLAs with teacher-guided reasons. We first augment robotic datasets with reasoning rationales generated by an expert teacher model, guiding VLA models to learn to reason about their actions. Then, we use \textit{ReFineVLA} to fine-tune pre-trained VLAs with the reasoning-enriched datasets, while maintaining their inherent generalization abilities and boosting reasoning capabilities. In addition, we conduct an attention map visualization to analyze the alignment among visual attention, linguistic prompts, and to-be-executed actions of \textit{ReFineVLA}, showcasing its ability to focus on relevant tasks and actions. Through the latter step, we explore that \textit{ReFineVLA}-trained models exhibit a meaningful attention shift towards relevant objects, highlighting the enhanced multimodal understanding and improved generalization.Evaluated across manipulation tasks, \textit{ReFineVLA} outperforms the state-of-the-art baselines. Specifically, it achieves an average increase of 5.0%5.0\%5.0% success rate on SimplerEnv WidowX Robot tasks, improves by an average of 8.6%8.6\%8.6% in variant aggregation settings, and by 1.7%1.7\%1.7% in visual matching settings for SimplerEnv Google Robot tasks. The source code will be publicly available.

View on arXiv
Comments on this paper