ViQAgent: Zero-Shot Video Question Answering via Agent with Open-Vocabulary Grounding Validation

21 May 2025

Tony Montes

Fernando Lozano

ArXiv (abs)PDF HTML

Main:8 Pages

10 Figures

Bibliography:3 Pages

18 Tables

Appendix:13 Pages

Abstract

Recent advancements in Video Question Answering (VideoQA) have introduced LLM-based agents, modular frameworks, and procedural solutions, yielding promising results. These systems use dynamic agents and memory-based mechanisms to break down complex tasks and refine answers. However, significant improvements remain in tracking objects for grounding over time and decision-making based on reasoning to better align object references with language model outputs, as newer models get better at both tasks. This work presents an LLM-brained agent for zero-shot Video Question Answering (VideoQA) that combines a Chain-of-Thought framework with grounding reasoning alongside YOLO-World to enhance object tracking and alignment. This approach establishes a new state-of-the-art in VideoQA and Video Understanding, showing enhanced performance on NExT-QA, iVQA, and ActivityNet-QA benchmarks. Our framework also enables cross-checking of grounding timeframes, improving accuracy and providing valuable support for verification and increased output reliability across multiple video domains. The code is available atthis https URL.

View on arXiv

Comments on this paper