All Papers
Title |
|---|
Title |
|---|

Multimodal large language models (MLLMs) demonstrate exceptional performance in vision-language tasks, yet their processing of long videos is constrained by input context length and high computational costs. Sparse frame sampling thus becomes a necessary preprocessing step, with sampled frame quality directly impacting downstream performance. Existing keyframe search algorithms achieve a balance between efficiency and sampled frame quality but heavily rely on the visual modality alone. This makes them difficult to adapt to text-related tasks and often leads to retrieval results deviating from core semantic content. To address this, we propose the VISUAL-SUBTITLE INTEGRATION (VSI), a multimodal keyframe retrieval framework. It employs a dual-branch collaborative retrieval approach combining Video Search and Subtitle Match to fuse complementary visual and textual information for precise localization. Experiments on LongVideoBench and VideoMME demonstrate that VSI achieves state-of-the-art accuracy in keyframe retrieval while delivering breakthrough performance in text-related tasks and exhibiting strong generalization across other tasks.
View on arXiv