ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2407.13766
27
7

Visual Haystacks: A Vision-Centric Needle-In-A-Haystack Benchmark

18 July 2024
Tsung-Han Wu
Giscard Biamby
Jerome Quenum
Ritwik Gupta
Joseph E. Gonzalez
Trevor Darrell
David M. Chan
    VLM
ArXivPDFHTML
Abstract

Large Multimodal Models (LMMs) have made significant strides in visual question-answering for single images. Recent advancements like long-context LMMs have allowed them to ingest larger, or even multiple, images. However, the ability to process a large number of visual tokens does not guarantee effective retrieval and reasoning for multi-image question answering (MIQA), especially in real-world applications like photo album searches or satellite imagery analysis. In this work, we first assess the limitations of current benchmarks for long-context LMMs. We address these limitations by introducing a new vision-centric, long-context benchmark, "Visual Haystacks (VHs)". We comprehensively evaluate both open-source and proprietary models on VHs, and demonstrate that these models struggle when reasoning across potentially unrelated images, perform poorly on cross-image reasoning, as well as exhibit biases based on the placement of key information within the context window. Towards a solution, we introduce MIRAGE (Multi-Image Retrieval Augmented Generation), an open-source, lightweight visual-RAG framework that processes up to 10k images on a single 40G A100 GPU -- far surpassing the 1k-image limit of contemporary models. MIRAGE demonstrates up to 13% performance improvement over existing open-source LMMs on VHs, sets a new state-of-the-art on the RetVQA multi-image QA benchmark, and achieves competitive performance on single-image QA with state-of-the-art LMMs. Our dataset, model, and code are available at:this https URL.

View on arXiv
@article{wu2025_2407.13766,
  title={ Visual Haystacks: A Vision-Centric Needle-In-A-Haystack Benchmark },
  author={ Tsung-Han Wu and Giscard Biamby and Jerome Quenum and Ritwik Gupta and Joseph E. Gonzalez and Trevor Darrell and David M. Chan },
  journal={arXiv preprint arXiv:2407.13766},
  year={ 2025 }
}
Comments on this paper