229

FOCUS: Efficient Keyframe Selection for Long Video Understanding

Main:8 Pages
4 Figures
Bibliography:7 Pages
10 Tables
Appendix:9 Pages
Abstract

Multimodal large language models (MLLMs) represent images and video frames as visual tokens. Scaling from single images to hour-long videos, however, inflates the token budget far beyond practical limits. Popular pipelines therefore either uniformly subsample or apply keyframe selection with retrieval-style scoring using smaller vision-language models. However, these keyframe selection methods still rely on pre-filtering before selection to reduce the inference cost and can miss the most informative moments.

View on arXiv
Comments on this paper