: Online RL Fine-tuning for Flow-based Vision-Language-Action Models
- OffRLVLM
Main:9 Pages
17 Figures
Bibliography:3 Pages
13 Tables
Appendix:12 Pages
Abstract
Vision-Language-Action (VLA) models enable robots to understand and perform complex tasks from multimodal input. Although recent work explores using reinforcement learning (RL) to automate the laborious data collection process in scaling supervised fine-tuning (SFT), applying large-scale RL to flow-based VLAs (e.g., , ) remains challenging due to intractable action log-likelihoods from iterative denoising.
View on arXivComments on this paper
