Testing the limits of fine-tuning to improve reasoning in vision language models

24 February 2025

Luca M. Schulze Buschoff

Konstantinos Voudouris

Abstract

Pre-trained vision language models still fall short of human visual cognition. In an effort to improve visual cognition and align models with human behavior, we introduce visual stimuli and human judgments on visual cognition tasks, allowing us to systematically evaluate performance across cognitive domains under a consistent environment. We fine-tune models on ground truth data for intuitive physics and causal reasoning and find that this improves model performance in the respective fine-tuning domain. Furthermore, it can improve model alignment with human behavior. However, we find that fine-tuning does not contribute to robust human-like generalization to data with other visual characteristics or to tasks in other cognitive domains.

View on arXiv

@article{buschoff2025_2502.15678,
  title={ Testing the limits of fine-tuning to improve reasoning in vision language models },
  author={ Luca M. Schulze Buschoff and Konstantinos Voudouris and Elif Akata and Matthias Bethge and Joshua B. Tenenbaum and Eric Schulz },
  journal={arXiv preprint arXiv:2502.15678},
  year={ 2025 }
}

Comments on this paper