LiteVL: Efficient Video-Language Learning with Enhanced Spatial-Temporal Modeling

Conference on Empirical Methods in Natural Language Processing (EMNLP), 2022

21 October 2022

Dongsheng Chen

Chaofan Tao

Lu Hou

Lifeng Shang

Xin Jiang

Qun Liu

VLM

ArXiv (abs)PDF HTML

Abstract

Recent large-scale video-language pre-trained models have shown appealing performance on various downstream tasks. However, the pre-training process is computationally expensive due to the requirement of millions of video-text pairs and the redundant data structure of each video. To mitigate these problems, we propose LiteVL, which adapts a pre-trained image-language model BLIP into a video-text model directly on downstream tasks, without heavy pre-training. To enhance the temporal modeling lacking in the image-language model, we propose to add temporal attention modules in the image encoder of BLIP with dynamic temporal scaling. Besides the model-wise adaptation, we also propose a non-parametric pooling mechanism to adaptively reweight the fine-grained video embedding conditioned on the text. Experimental results on text-video retrieval and video question answering show that the proposed LiteVL even outperforms previous video-language pre-trained models by a clear margin, though without any video-language pre-training.

View on arXiv

Comments on this paper