ResearchTrend.AI
  • Papers
  • Communities
  • Organizations
  • Events
  • Blog
  • Pricing
  • Feedback
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2110.06615
  4. Cited By
CLIP4Caption: CLIP for Video Caption

CLIP4Caption: CLIP for Video Caption

13 October 2021
Mingkang Tang
Zhanyu Wang
Zhenhua Liu
Fengyun Rao
Dian Li
Xiu Li
    CLIPVLM
ArXiv (abs)PDFHTML

Papers citing "CLIP4Caption: CLIP for Video Caption"

32 / 32 papers shown
Title
MobileViCLIP: An Efficient Video-Text Model for Mobile Devices
MobileViCLIP: An Efficient Video-Text Model for Mobile Devices
Min Yang
Zihan Jia
Zhilin Dai
Sheng Guo
Limin Wang
CLIPVLM
30
0
0
10 Aug 2025
SmartCLIP: Modular Vision-language Alignment with Identification Guarantees
SmartCLIP: Modular Vision-language Alignment with Identification Guarantees
Shaoan Xie
Lingjing Kong
Yujia Zheng
Yu Yao
Zeyu Tang
Eric Xing
Guangyi Chen
Kun Zhang
VLM
33
0
0
29 Jul 2025
Fusing Cross-modal and Uni-modal Representations: A Kronecker Product Approach
Youqi Wu
Jingwei Zhang
Farzan Farnia
85
1
0
10 Jun 2025
Video-Level Language-Driven Video-Based Visible-Infrared Person Re-Identification
Video-Level Language-Driven Video-Based Visible-Infrared Person Re-Identification
Shuang Li
Jiaxu Leng
Changjiang Kuang
Mingpi Tan
Xinbo Gao
114
2
0
03 Jun 2025
SPKLIP: Aligning Spike Video Streams with Natural Language
SPKLIP: Aligning Spike Video Streams with Natural Language
Yongchang Gao
Meiling Jin
Zhaofei Yu
Tiejun Huang
Guozhang Chen
CLIPVLM
295
0
0
19 May 2025
Generative Modeling of Class Probability for Multi-Modal Representation Learning
Generative Modeling of Class Probability for Multi-Modal Representation Learning
Jungkyoo Shin
Bumsoo Kim
Eunwoo Kim
168
1
0
21 Mar 2025
MMRL: Multi-Modal Representation Learning for Vision-Language Models
MMRL: Multi-Modal Representation Learning for Vision-Language Models
Yuncheng Guo
Xiaodong Gu
VLMOffRL
619
8
0
11 Mar 2025
Capturing Rich Behavior Representations: A Dynamic Action Semantic-Aware Graph Transformer for Video Captioning
Capturing Rich Behavior Representations: A Dynamic Action Semantic-Aware Graph Transformer for Video Captioning
Caihua Liu
Xu Li
Wenjing Xue
Wei Tang
Xia Feng
95
0
0
20 Feb 2025
Hierarchical Banzhaf Interaction for General Video-Language Representation Learning
Hierarchical Banzhaf Interaction for General Video-Language Representation Learning
Peng Jin
Haoyang Li
Li Yuan
Shuicheng Yan
Jie Chen
199
2
0
31 Dec 2024
SPECTRUM: Semantic Processing and Emotion-informed video-Captioning
  Through Retrieval and Understanding Modalities
SPECTRUM: Semantic Processing and Emotion-informed video-Captioning Through Retrieval and Understanding Modalities
Ehsan Faghihi
Mohammedreza Zarenejad
Ali-Asghar Beheshti Shirazi
119
1
0
04 Nov 2024
CLIP-Driven Cloth-Agnostic Feature Learning for Cloth-Changing Person
  Re-Identification
CLIP-Driven Cloth-Agnostic Feature Learning for Cloth-Changing Person Re-Identification
Shuang Li
Jiaxu Leng
Guozhang Li
Ji Gan
Haosheng chen
Xinbo Gao
126
5
0
13 Jun 2024
Mind the Modality Gap: Towards a Remote Sensing Vision-Language Model via Cross-modal Alignment
Mind the Modality Gap: Towards a Remote Sensing Vision-Language Model via Cross-modal Alignment
Angelos Zavras
Dimitrios Michail
Begüm Demir
Ioannis Papoutsis
VLM
174
14
0
15 Feb 2024
An Initial Exploration: Learning to Generate Realistic Audio for Silent
  Video
An Initial Exploration: Learning to Generate Realistic Audio for Silent Video
Matthew Martel
Jack Wagner
VGen
59
0
0
23 Aug 2023
ViCo: Engaging Video Comment Generation with Human Preference Rewards
ViCo: Engaging Video Comment Generation with Human Preference Rewards
Yuchong Sun
Bei Liu
Xu Chen
Ruihua Song
Jianlong Fu
VGen
67
2
0
22 Aug 2023
Open-Vocabulary Object Detection via Scene Graph Discovery
Open-Vocabulary Object Detection via Scene Graph Discovery
Hengcan Shi
Munawar Hayat
Jianfei Cai
ObjD
123
13
0
07 Jul 2023
MSVD-Indonesian: A Benchmark for Multimodal Video-Text Tasks in Indonesian
MSVD-Indonesian: A Benchmark for Multimodal Video-Text Tasks in Indonesian
Willy Fitra Hendria
85
3
0
20 Jun 2023
MMNet: Multi-Mask Network for Referring Image Segmentation
MMNet: Multi-Mask Network for Referring Image Segmentation
Yimin Yan
Xingjian He
Wenxuan Wan
Qingbin Liu
EgoV
118
2
0
24 May 2023
Few-Shot Learning with Visual Distribution Calibration and Cross-Modal
  Distribution Alignment
Few-Shot Learning with Visual Distribution Calibration and Cross-Modal Distribution Alignment
Runqi Wang
Hao Zheng
Xiaoyue Duan
Jianzhuang Liu
Yuning Lu
Tian Wang
Songcen Xu
Baochang Zhang
VLM
86
12
0
19 May 2023
VideoOFA: Two-Stage Pre-Training for Video-to-Text Generation
VideoOFA: Two-Stage Pre-Training for Video-to-Text Generation
Xilun Chen
L. Yu
Wenhan Xiong
Barlas Ouguz
Yashar Mehdad
Wen-tau Yih
VGen
83
3
0
04 May 2023
AutoAD: Movie Description in Context
AutoAD: Movie Description in Context
Tengda Han
Max Bain
Arsha Nagrani
Gül Varol
Weidi Xie
Andrew Zisserman
VGen
101
41
0
29 Mar 2023
Improving Audio-Visual Video Parsing with Pseudo Visual Labels
Improving Audio-Visual Video Parsing with Pseudo Visual Labels
Jinxing Zhou
Dan Guo
Yiran Zhong
Meng Wang
VLM
115
17
0
04 Mar 2023
mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image
  and Video
mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video
Haiyang Xu
Qinghao Ye
Mingshi Yan
Yaya Shi
Jiabo Ye
...
Guohai Xu
Ji Zhang
Songfang Huang
Feiran Huang
Jingren Zhou
MLLMVLMMoE
147
190
0
01 Feb 2023
HiTeA: Hierarchical Temporal-Aware Video-Language Pre-training
HiTeA: Hierarchical Temporal-Aware Video-Language Pre-training
Qinghao Ye
Guohai Xu
Ming Yan
Haiyang Xu
Qi Qian
Ji Zhang
Fei Huang
VLMAI4TS
249
82
0
30 Dec 2022
CLIP-Driven Fine-grained Text-Image Person Re-identification
CLIP-Driven Fine-grained Text-Image Person Re-identification
Shuanglin Yan
Neng Dong
Liyan Zhang
Jinhui Tang
127
109
0
19 Oct 2022
REST: REtrieve & Self-Train for generative action recognition
REST: REtrieve & Self-Train for generative action recognition
Adrian Bulat
Enrique Sanchez
Brais Martínez
Georgios Tzimiropoulos
VLM
78
4
0
29 Sep 2022
Visual Subtitle Feature Enhanced Video Outline Generation
Visual Subtitle Feature Enhanced Video Outline Generation
Qi Lv
Ziqiang Cao
Wenrui Xie
Derui Wang
Jingwen Wang
...
Yuan-Fang Li
Min Cao
Wenjie Li
Sujian Li
Guohong Fu
VGen
123
0
0
24 Aug 2022
Zero-Shot Video Captioning with Evolving Pseudo-Tokens
Zero-Shot Video Captioning with Evolving Pseudo-Tokens
Yoad Tewel
Yoav Shalev
Roy Nadler
Idan Schwartz
Lior Wolf
88
30
0
22 Jul 2022
What is Where by Looking: Weakly-Supervised Open-World Phrase-Grounding
  without Text Inputs
What is Where by Looking: Weakly-Supervised Open-World Phrase-Grounding without Text Inputs
Tal Shaharabany
Yoad Tewel
Lior Wolf
ObjD
114
19
0
19 Jun 2022
CLIP4IDC: CLIP for Image Difference Captioning
CLIP4IDC: CLIP for Image Difference Captioning
Zixin Guo
Tong Wang
Jorma T. Laaksonen
VLM
84
33
0
01 Jun 2022
Unsupervised Prompt Learning for Vision-Language Models
Unsupervised Prompt Learning for Vision-Language Models
Hao Huang
Jack Chu
Fangyun Wei
VPVLMMLLMVLM
180
142
0
07 Apr 2022
Democratizing Contrastive Language-Image Pre-training: A CLIP Benchmark
  of Data, Model, and Supervision
Democratizing Contrastive Language-Image Pre-training: A CLIP Benchmark of Data, Model, and Supervision
Yufeng Cui
Lichen Zhao
Feng Liang
Yangguang Li
Jing Shao
UQCVVLMCLIP
164
45
0
11 Mar 2022
CRIS: CLIP-Driven Referring Image Segmentation
CRIS: CLIP-Driven Referring Image Segmentation
Zhaoqing Wang
Yu Lu
Qiang Li
Xunqiang Tao
Yan Guo
Ming Gong
Tongliang Liu
VLM
217
397
0
30 Nov 2021
1