ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2504.05040
24
0

InstructionBench: An Instructional Video Understanding Benchmark

7 April 2025
Haiwan Wei
Yitian Yuan
Xiaohan Lan
Wei Ke
Lin Ma
    ELM
ArXivPDFHTML
Abstract

Despite progress in video large language models (Video-LLMs), research on instructional video understanding, crucial for enhancing access to instructional content, remains insufficient. To address this, we introduce InstructionBench, an Instructional video understanding Benchmark, which challenges models' advanced temporal reasoning within instructional videos characterized by their strict step-by-step flow. Employing GPT-4, we formulate Q\&A pairs in open-ended and multiple-choice formats to assess both Coarse-Grained event-level and Fine-Grained object-level reasoning. Our filtering strategies exclude questions answerable purely by common-sense knowledge, focusing on visual perception and analysis when evaluating Video-LLM models. The benchmark finally contains 5k questions across over 700 videos. We evaluate the latest Video-LLMs on our InstructionBench, finding that closed-source models outperform open-source ones. However, even the best model, GPT-4o, achieves only 53.42\% accuracy, indicating significant gaps in temporal reasoning. To advance the field, we also develop a comprehensive instructional video dataset with over 19k Q\&A pairs from nearly 2.5k videos, using an automated data generation framework, thereby enriching the community's research resources.

View on arXiv
@article{wei2025_2504.05040,
  title={ InstructionBench: An Instructional Video Understanding Benchmark },
  author={ Haiwan Wei and Yitian Yuan and Xiaohan Lan and Wei Ke and Lin Ma },
  journal={arXiv preprint arXiv:2504.05040},
  year={ 2025 }
}
Comments on this paper