IV-Bench: A Benchmark for Image-Grounded Video Perception and Reasoning in Multimodal LLMs

21 April 2025

Abstract

Existing evaluation frameworks for Multimodal Large Language Models (MLLMs) primarily focus on image reasoning or general video understanding tasks, largely overlooking the significant role of image context in video comprehension. To bridge this gap, we propose IV-Bench, the first comprehensive benchmark for evaluating Image-Grounded Video Perception and Reasoning. IV-Bench consists of 967 videos paired with 2,585 meticulously annotated image-text queries across 13 tasks (7 perception and 6 reasoning tasks) and 5 representative categories. Extensive evaluations of state-of-the-art open-source (e.g., InternVL2.5, Qwen2.5-VL) and closed-source (e.g., GPT-4o, Gemini2-Flash and Gemini2-Pro) MLLMs demonstrate that current models substantially underperform in image-grounded video Perception and Reasoning, merely achieving at most 28.9% accuracy. Further analysis reveals key factors influencing model performance on IV-Bench, including inference pattern, frame number, and resolution. Additionally, through a simple data synthesis approach, we demonstratethe challenges of IV- Bench extend beyond merely aligning the data format in the training proecss. These findings collectively provide valuable insights for future research. Our codes and data are released inthis https URL.

View on arXiv

@article{ma2025_2504.15415,
  title={ IV-Bench: A Benchmark for Image-Grounded Video Perception and Reasoning in Multimodal LLMs },
  author={ David Ma and Yuanxing Zhang and Jincheng Ren and Jarvis Guo and Yifan Yao and Zhenlin Wei and Zhenzhu Yang and Zhongyuan Peng and Boyu Feng and Jun Ma and Xiao Gu and Zhoufutu Wen and King Zhu and Yancheng He and Meng Cao and Shiwen Ni and Jiaheng Liu and Wenhao Huang and Ge Zhang and Xiaojie Jin },
  journal={arXiv preprint arXiv:2504.15415},
  year={ 2025 }
}

Comments on this paper