Learning to Think Fast and Slow for Visual Language Models
- ReLMVLMOffRLLRM
When faced with complex problems, we tend to engage in slower, more deliberate thinking. In contrast, for simple questions we give quick, intuitive responses. This dual-system thinking approach allows us to allocate cognitive resources efficiently, reserving deeper analytical effort for tasks that truly require it. However, existing reasoning-oriented visual language models (VLMs) are mostly trained to generate uniformly long reasoning, leading to substantial token waste when concise answers would suffice. In this paper, we observe that pre-trained, general-purpose VLMs manifest variations in response length for different question types, e.g., longer reasoning for math questions while shorter on perception problems. Different from existing work that overrides this prior by stimulating long reasoning without considering the problem complexity, we propose to leverage this prior to develop an explicit dual-mode thinking mechanism. Specifically, we anchor each training instance to either a fast or slow thinking prefix consistent with the model's natural response length tendency. Then, GRPO is adapted to learning dual-system thinking, enabling both autonomous and manual thinking mode selection. Extensive experiments across a wide variety of visual reasoning benchmarks demonstrate that our model, named DualMindVLM, significantly outperforms the base model and achieves state-of-the-art reasoning performance while maintaining high token efficiency.
View on arXiv