VTool-R1: VLMs Learn to Think with Images via Reinforcement Learning on Multimodal Tool Use
- LRM
Main:10 Pages
4 Figures
Bibliography:3 Pages
3 Tables
Appendix:9 Pages
Abstract
Reinforcement Learning Finetuning (RFT) has significantly advanced the reasoning capabilities of large language models (LLMs) by enabling long chains of thought, self-correction, and effective tool use. While recent works attempt to extend RFT to vision-language models (VLMs), these efforts largely produce text-only reasoning conditioned on static image inputs, falling short of true multimodal reasoning in the response. In contrast, test-time methods like Visual Sketchpad incorporate visual steps but lack training mechanisms.
View on arXivComments on this paper
