v1v2 (latest)

VLog: Video-Language Models by Generative Retrieval of Narration Vocabulary

Computer Vision and Pattern Recognition (CVPR), 2025

12 March 2025

Kevin Qinghong Lin

Mike Zheng Shou

VGen

ArXiv (abs)PDF HTML HuggingFace (8 upvotes)

Main:10 Pages

13 Figures

Bibliography:3 Pages

6 Tables

Appendix:1 Pages

Abstract

Human daily activities can be concisely narrated as sequences of routine events (e.g., turning off an alarm) in video streams, forming an event vocabulary. Motivated by this, we introduce VLog, a novel video understanding framework that define video narrations as vocabulary, going beyond the typical subword vocabularies in existing generative video-language models. Built on the lightweight language model GPT-2, VLog feature three key innovations: (i) A generative retrieval model, marrying language model's complex reasoning capabilities with contrastive retrieval's flexible upgrading over narration vocabulary. (ii) A hierarchical vocabulary derived from large-scale video narrations using our narration pair encoding algorithm, enabling efficient indexing of specific events (e.g., cutting a tomato) by identifying broader scenarios (e.g., kitchen) with expressive postfixes (e.g., by the left hand). (iii) A vocabulary update strategy leveraging generative models to extend the vocabulary for novel events encountered during inference. To validate our approach, we introduce VidCap-Eval, a development set requiring concise narrations with reasoning relationships (e.g., before and after). Experiments on EgoSchema, COIN, and HiREST further demonstrate the effectiveness of VLog, highlighting its ability to generate concise, contextually accurate, and efficient narrations, offering a novel perspective on video understanding. Codes are released atthis https URL.

View on arXiv

Comments on this paper