35
9

CoT-VLA: Visual Chain-of-Thought Reasoning for Vision-Language-Action Models

Abstract

Vision-language-action models (VLAs) have shown potential in leveraging pretrained vision-language models and diverse robot demonstrations for learning generalizable sensorimotor control. While this paradigm effectively utilizes large-scale data from both robotic and non-robotic sources, current VLAs primarily focus on direct input--output mappings, lacking the intermediate reasoning steps crucial for complex manipulation tasks. As a result, existing VLAs lack temporal planning or reasoning capabilities. In this paper, we introduce a method that incorporates explicit visual chain-of-thought (CoT) reasoning into vision-language-action models (VLAs) by predicting future image frames autoregressively as visual goals before generating a short action sequence to achieve these goals. We introduce CoT-VLA, a state-of-the-art 7B VLA that can understand and generate visual and action tokens. Our experimental results demonstrate that CoT-VLA achieves strong performance, outperforming the state-of-the-art VLA model by 17% in real-world manipulation tasks and 6% in simulation benchmarks. Project website:this https URL

View on arXiv
@article{zhao2025_2503.22020,
  title={ CoT-VLA: Visual Chain-of-Thought Reasoning for Vision-Language-Action Models },
  author={ Qingqing Zhao and Yao Lu and Moo Jin Kim and Zipeng Fu and Zhuoyang Zhang and Yecheng Wu and Zhaoshuo Li and Qianli Ma and Song Han and Chelsea Finn and Ankur Handa and Ming-Yu Liu and Donglai Xiang and Gordon Wetzstein and Tsung-Yi Lin },
  journal={arXiv preprint arXiv:2503.22020},
  year={ 2025 }
}
Comments on this paper