ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2411.12951
75
2

On the Consistency of Video Large Language Models in Temporal Comprehension

20 November 2024
Minjoon Jung
Junbin Xiao
Byoung-Tak Zhang
Angela Yao
ArXivPDFHTML
Abstract

Video large language models (Video-LLMs) can temporally ground language queries and retrieve video moments. Yet, such temporal comprehension capabilities are neither well-studied nor understood. So we conduct a study on prediction consistency -- a key indicator for robustness and trustworthiness of temporal grounding. After the model identifies an initial moment within the video content, we apply a series of probes to check if the model's responses align with this initial grounding as an indicator of reliable comprehension. Our results reveal that current Video-LLMs are sensitive to variations in video contents, language queries, and task settings, unveiling severe deficiencies in maintaining consistency. We further explore common prompting and instruction-tuning methods as potential solutions, but find that their improvements are often unstable. To that end, we propose event temporal verification tuning that explicitly accounts for consistency, and demonstrate significant improvements for both grounding and consistency. Our data and code are open-sourced atthis https URL.

View on arXiv
@article{jung2025_2411.12951,
  title={ On the Consistency of Video Large Language Models in Temporal Comprehension },
  author={ Minjoon Jung and Junbin Xiao and Byoung-Tak Zhang and Angela Yao },
  journal={arXiv preprint arXiv:2411.12951},
  year={ 2025 }
}
Comments on this paper