Vision-Language-Action (VLA) models enable robots to understand and perform complex tasks from multimodal input. Although recent work explores using reinforcement learning (RL) to automate the laborious data collection process in scaling supervised fine-tuning (SFT), applying large-scale RL to flow-based VLAs (e.g., $\pi_0$ , $\pi_{0.5}$ ) remains challenging due to intractable action log-likelihoods from iterative denoising.

View on arXiv

Comments on this paper

πRLπ_\texttt{RL}πRL​: Online RL Fine-tuning for Flow-based Vision-Language-Action Models

$π_\texttt{RL}$ : Online RL Fine-tuning for Flow-based Vision-Language-Action Models