EVE: Towards End-to-End Video Subtitle Extraction with Vision-Language Models

6 March 2025

Abstract

The advent of Large Vision-Language Models (LVLMs) has advanced the video-based tasks, such as video captioning and video understanding. Some previous research indicates that taking texts in videos as input can further improve the performance of video understanding. As a type of indispensable information in short videos or movies, subtitles can assist LVLMs to better understand videos. Most existing methods for video subtitle extraction are based on a multi-stage framework, handling each frame independently. They can hardly exploit the temporal information of videos. Although some LVLMs exhibit the robust OCR capability, predicting accurate timestamps for subtitle texts is still challenging. In this paper, we propose an End-to-end Video Subtitle Extraction method, called EVE, which consists of three modules: a vision encoder, an adapter module, and a large language model. To effectively compress the visual tokens from the vision encoder, we propose a novel adapter InterleavedVT to interleave two modalities. It contains a visual compressor and a textual region compressor. The proposed InterleavedVT exploits both the merits of average pooling and Q-Former in token compression. Taking the temporal information of videos into account, we introduce a sliding-window mechanism in the textual region compressor. To benchmark the video subtitle extraction task, we propose a large dataset ViSa including 2.5M videos. Extensive experiments on ViSa demonstrate that the proposed EVE can outperform existing open-sourced tools and LVLMs.

View on arXiv

@article{yu2025_2503.04058,
  title={ EVE: Towards End-to-End Video Subtitle Extraction with Vision-Language Models },
  author={ Haiyang Yu and Jinghui Lu and Yanjie Wang and Yang Li and Han Wang and Can Huang and Bin Li },
  journal={arXiv preprint arXiv:2503.04058},
  year={ 2025 }
}

Comments on this paper