FOCUS: Efficient Keyframe Selection for Long Video Understanding

31 October 2025

Zirui Zhu

ArXiv (abs)PDF HTML Github (23★)

Main:8 Pages

4 Figures

Bibliography:7 Pages

10 Tables

Appendix:9 Pages

Abstract

Multimodal large language models (MLLMs) represent images and video frames as visual tokens. Scaling from single images to hour-long videos, however, inflates the token budget far beyond practical limits. Popular pipelines therefore either uniformly subsample or apply keyframe selection with retrieval-style scoring using smaller vision-language models. However, these keyframe selection methods still rely on pre-filtering before selection to reduce the inference cost and can miss the most informative moments.

View on arXiv

Comments on this paper