ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2411.00238
34
13

Understanding the Limits of Vision Language Models Through the Lens of the Binding Problem

31 October 2024
Declan Campbell
Sunayana Rane
Tyler Giallanza
Nicolò De Sabbata
Kia Ghods
Amogh Joshi
Alexander Ku
Steven M. Frankland
Thomas L. Griffiths
Jonathan D. Cohen
Taylor W. Webb
ArXivPDFHTML
Abstract

Recent work has documented striking heterogeneity in the performance of state-of-the-art vision language models (VLMs), including both multimodal language models and text-to-image models. These models are able to describe and generate a diverse array of complex, naturalistic images, yet they exhibit surprising failures on basic multi-object reasoning tasks -- such as counting, localization, and simple forms of visual analogy -- that humans perform with near perfect accuracy. To better understand this puzzling pattern of successes and failures, we turn to theoretical accounts of the binding problem in cognitive science and neuroscience, a fundamental problem that arises when a shared set of representational resources must be used to represent distinct entities (e.g., to represent multiple objects in an image), necessitating the use of serial processing to avoid interference. We find that many of the puzzling failures of state-of-the-art VLMs can be explained as arising due to the binding problem, and that these failure modes are strikingly similar to the limitations exhibited by rapid, feedforward processing in the human brain.

View on arXiv
@article{campbell2025_2411.00238,
  title={ Understanding the Limits of Vision Language Models Through the Lens of the Binding Problem },
  author={ Declan Campbell and Sunayana Rane and Tyler Giallanza and Nicolò De Sabbata and Kia Ghods and Amogh Joshi and Alexander Ku and Steven M. Frankland and Thomas L. Griffiths and Jonathan D. Cohen and Taylor W. Webb },
  journal={arXiv preprint arXiv:2411.00238},
  year={ 2025 }
}
Comments on this paper