Is Temporal Prompting All We Need For Limited Labeled Action Recognition?

2 April 2025

Abstract

Video understanding has shown remarkable improvements in recent years, largely dependent on the availability of large scaled labeled datasets. Recent advancements in visual-language models, especially based on contrastive pretraining, have shown remarkable generalization in zero-shot tasks, helping to overcome this dependence on labeled datasets. Adaptations of such models for videos, typically involve modifying the architecture of vision-language models to cater to video data. However, this is not trivial, since such adaptations are mostly computationally intensive and struggle with temporal modeling. We present TP-CLIP, an adaptation of CLIP that leverages temporal visual prompting for temporal adaptation without modifying the core CLIP architecture. This preserves its generalization abilities. TP-CLIP efficiently integrates into the CLIP architecture, leveraging its pre-trained capabilities for video data. Extensive experiments across various datasets demonstrate its efficacy in zero-shot and few-shot learning, outperforming existing approaches with fewer parameters and computational efficiency. In particular, we use just 1/3 the GFLOPs and 1/28 the number of tuneable parameters in comparison to recent state-of-the-art and still outperform it by up to 15.8% depending on the task and dataset.

View on arXiv

@article{gowda2025_2504.01890,
  title={ Is Temporal Prompting All We Need For Limited Labeled Action Recognition? },
  author={ Shreyank N Gowda and Boyan Gao and Xiao Gu and Xiaobo Jin },
  journal={arXiv preprint arXiv:2504.01890},
  year={ 2025 }
}

Comments on this paper