ResearchTrend.AI
  • Papers
  • Communities
  • Organizations
  • Events
  • Blog
  • Pricing
  • Feedback
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2109.14084
  4. Cited By
VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text
  Understanding
v1v2 (latest)

VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding

28 September 2021
Hu Xu
Gargi Ghosh
Po-Yao (Bernie) Huang
Dmytro Okhonko
Armen Aghajanyan
Florian Metze
Luke Zettlemoyer
Florian Metze Luke Zettlemoyer Christoph Feichtenhofer
    CLIPVLM
ArXiv (abs)PDFHTMLGithub (31473★)

Papers citing "VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding"

50 / 439 papers shown
Title
In-Style: Bridging Text and Uncurated Videos with Style Transfer for
  Text-Video Retrieval
In-Style: Bridging Text and Uncurated Videos with Style Transfer for Text-Video Retrieval
Nina Shvetsova
Anna Kukleva
Bernt Schiele
Hilde Kuehne
DiffM
98
5
0
16 Sep 2023
Masked Diffusion with Task-awareness for Procedure Planning in
  Instructional Videos
Masked Diffusion with Task-awareness for Procedure Planning in Instructional Videos
Fen Fang
Yun Liu
Ali Koksal
Qianli Xu
Joo-Hwee Lim
VGenDiffM
115
6
0
14 Sep 2023
Can I Trust Your Answer? Visually Grounded Video Question Answering
Can I Trust Your Answer? Visually Grounded Video Question Answering
Junbin Xiao
Angela Yao
Yicong Li
Tat-Seng Chua
184
76
0
04 Sep 2023
AttrSeg: Open-Vocabulary Semantic Segmentation via Attribute
  Decomposition-Aggregation
AttrSeg: Open-Vocabulary Semantic Segmentation via Attribute Decomposition-Aggregation
Chaofan Ma
Yu-Hao Yang
Chen Ju
Fei Zhang
Ya Zhang
Yanfeng Wang
VLM
204
23
0
31 Aug 2023
Cross-Modal Retrieval Meets Inference:Improving Zero-Shot Classification
  with Cross-Modal Retrieval
Cross-Modal Retrieval Meets Inference:Improving Zero-Shot Classification with Cross-Modal Retrieval
Seong-Hoon Eom
Namgyu Ho
Jaehoon Oh
Se-Young Yun
CLIPVLM
91
1
0
29 Aug 2023
CoVR: Learning Composed Video Retrieval from Web Video Captions
CoVR: Learning Composed Video Retrieval from Web Video Captions
Lucas Ventura
Antoine Yang
Cordelia Schmid
Gül Varol
144
33
0
28 Aug 2023
Language Reward Modulation for Pretraining Reinforcement Learning
Language Reward Modulation for Pretraining Reinforcement Learning
Ademi Adeniji
Amber Xie
Carmelo Sferrazza
Younggyo Seo
Stephen James
Pieter Abbeel
117
34
0
23 Aug 2023
Learning from Semantic Alignment between Unpaired Multiviews for
  Egocentric Video Recognition
Learning from Semantic Alignment between Unpaired Multiviews for Egocentric Video Recognition
Qitong Wang
Long Zhao
Liangzhe Yuan
Ting Liu
Xi Peng
171
17
0
22 Aug 2023
Opening the Vocabulary of Egocentric Actions
Opening the Vocabulary of Egocentric Actions
Dibyadip Chatterjee
Fadime Sener
Shugao Ma
Angela Yao
VLM
165
18
0
22 Aug 2023
VadCLIP: Adapting Vision-Language Models for Weakly Supervised Video
  Anomaly Detection
VadCLIP: Adapting Vision-Language Models for Weakly Supervised Video Anomaly Detection
Peng Wu
Xu Zhou
Guansong Pang
Lingru Zhou
Qingsen Yan
Peng Wang
Yanning Zhang
CLIPVLM
134
110
0
22 Aug 2023
UnLoc: A Unified Framework for Video Localization Tasks
UnLoc: A Unified Framework for Video Localization Tasks
Shengjia Yan
Xuehan Xiong
Arsha Nagrani
Anurag Arnab
Zhonghao Wang
Weina Ge
David A. Ross
Cordelia Schmid
148
64
0
21 Aug 2023
Long-range Multimodal Pretraining for Movie Understanding
Long-range Multimodal Pretraining for Movie Understanding
Dawit Mureja Argaw
Joon-Young Lee
Markus Woodson
In So Kweon
Fabian Caba Heilbron
VLM
93
10
0
18 Aug 2023
The Unreasonable Effectiveness of Large Language-Vision Models for
  Source-free Video Domain Adaptation
The Unreasonable Effectiveness of Large Language-Vision Models for Source-free Video Domain Adaptation
Giacomo Zara
Alessandro Conti
Subhankar Roy
Stéphane Lathuilière
Paolo Rota
Elisa Ricci
108
14
0
17 Aug 2023
Tem-adapter: Adapting Image-Text Pretraining for Video Question Answer
Tem-adapter: Adapting Image-Text Pretraining for Video Question Answer
Guangyi Chen
Xiao Liu
Guangrun Wang
Kun Zhang
Philip H.S.Torr
Xiaoping Zhang
Yansong Tang
151
25
0
16 Aug 2023
Helping Hands: An Object-Aware Ego-Centric Video Recognition Model
Helping Hands: An Object-Aware Ego-Centric Video Recognition Model
Chuhan Zhang
Ankush Gupta
Andrew Zisserman
VLM
96
27
0
15 Aug 2023
Prompt Switch: Efficient CLIP Adaptation for Text-Video Retrieval
Prompt Switch: Efficient CLIP Adaptation for Text-Video Retrieval
Chaorui Deng
Qi Chen
Pengda Qin
Dave Zhenyu Chen
Qi Wu
VLMCLIP
101
36
0
15 Aug 2023
Orthogonal Temporal Interpolation for Zero-Shot Video Recognition
Orthogonal Temporal Interpolation for Zero-Shot Video Recognition
Yan Zhu
Junbao Zhuo
B. Ma
Jiajia Geng
Xiaoming Wei
Xiaolin K. Wei
Shuhui Wang
VLM
96
6
0
14 Aug 2023
Beyond First Impressions: Integrating Joint Multi-modal Cues for
  Comprehensive 3D Representation
Beyond First Impressions: Integrating Joint Multi-modal Cues for Comprehensive 3D Representation
Haowei Wang
Jiji Tang
Jiayi Ji
Xiaoshuai Sun
Rongsheng Zhang
...
Minda Zhao
Lincheng Li
zeng zhao
Tangjie Lv
Rongrong Ji
3DV
125
19
0
06 Aug 2023
Detecting Cloud Presence in Satellite Images Using the RGB-based CLIP
  Vision-Language Model
Detecting Cloud Presence in Satellite Images Using the RGB-based CLIP Vision-Language Model
Mikolaj Czerkawski
Robert C. Atkinson
Christos Tachtatzis
VLM
63
3
0
01 Aug 2023
MovieChat: From Dense Token to Sparse Memory for Long Video
  Understanding
MovieChat: From Dense Token to Sparse Memory for Long Video Understanding
Enxin Song
Wenhao Chai
Guanhong Wang
Yucheng Zhang
Haoyang Zhou
...
Tianbo Ye
Yanting Zhang
Yang Lu
Lei Li
Gaoang Wang
VLMMLLM
254
358
0
31 Jul 2023
Learning Multi-modal Representations by Watching Hundreds of Surgical Video Lectures
Learning Multi-modal Representations by Watching Hundreds of Surgical Video Lectures
Kun Yuan
V. Srivastav
Tong Yu
Joël L. Lavanchy
J. Marescaux
Pietro Mascagni
Nassir Navab
N. Padoy
300
33
0
27 Jul 2023
Discovering Spatio-Temporal Rationales for Video Question Answering
Discovering Spatio-Temporal Rationales for Video Question Answering
Yicong Li
Junbin Xiao
Chun Feng
Xiang Wang
Tat-Seng Chua
142
15
0
22 Jul 2023
Video-Mined Task Graphs for Keystep Recognition in Instructional Videos
Video-Mined Task Graphs for Keystep Recognition in Instructional Videos
Kumar Ashutosh
Santhosh Kumar Ramakrishnan
Triantafyllos Afouras
Kristen Grauman
175
27
0
17 Jul 2023
InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding
  and Generation
InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation
Yi Wang
Yinan He
Yizhuo Li
Kunchang Li
Jiashuo Yu
...
Ping Luo
Ziwei Liu
Yali Wang
Limin Wang
Yu Qiao
VLMVGen
188
317
0
13 Jul 2023
EgoVLPv2: Egocentric Video-Language Pre-training with Fusion in the
  Backbone
EgoVLPv2: Egocentric Video-Language Pre-training with Fusion in the Backbone
Shraman Pramanick
Yale Song
Sayan Nag
Kevin Qinghong Lin
Hardik Shah
Mike Zheng Shou
Ramalingam Chellappa
Pengchuan Zhang
VLM
176
111
0
11 Jul 2023
MotionGPT: Human Motion as a Foreign Language
MotionGPT: Human Motion as a Foreign Language
Biao Jiang
Xin Chen
Wen Liu
Jingyi Yu
Gang Yu
Tao Chen
MLLM
153
353
0
26 Jun 2023
A Solution to CVPR'2023 AQTC Challenge: Video Alignment for Multi-Step
  Inference
A Solution to CVPR'2023 AQTC Challenge: Video Alignment for Multi-Step Inference
Chao Zhang
Shiwei Wu
Sirui Zhao
Tong Xu
Enhong Chen
75
0
0
26 Jun 2023
ContentCTR: Frame-level Live Streaming Click-Through Rate Prediction
  with Multimodal Transformer
ContentCTR: Frame-level Live Streaming Click-Through Rate Prediction with Multimodal Transformer
Jiaxin Deng
Dong Shen
Shiyao Wang
Xiangyu Wu
Fan Yang
Guorui Zhou
Gaofeng Meng
78
2
0
26 Jun 2023
Meta-Personalizing Vision-Language Models to Find Named Instances in
  Video
Meta-Personalizing Vision-Language Models to Find Named Instances in Video
Chun-Hsiao Yeh
Bryan C. Russell
Josef Sivic
Fabian Caba Heilbron
Simon Jenni
VLMMLLM
119
12
0
16 Jun 2023
Vision-Language Models can Identify Distracted Driver Behavior from
  Naturalistic Videos
Vision-Language Models can Identify Distracted Driver Behavior from Naturalistic Videos
Md Zahid Hasan
Jiajing Chen
Jiyang Wang
Mohammed Shaiqur Rahman
Ameya Joshi
Senem Velipasalar
Chinmay Hegde
Anuj Sharma
Soumik Sarkar
VLM
176
28
0
16 Jun 2023
Retrieving-to-Answer: Zero-Shot Video Question Answering with Frozen
  Large Language Models
Retrieving-to-Answer: Zero-Shot Video Question Answering with Frozen Large Language Models
Junting Pan
Ziyi Lin
Yuying Ge
Xiatian Zhu
Renrui Zhang
Yi Wang
Yu Qiao
Hongsheng Li
MLLM
114
28
0
15 Jun 2023
Instruct-ReID: A Multi-purpose Person Re-identification Task with Instructions
Instruct-ReID: A Multi-purpose Person Re-identification Task with Instructions
Xun Guo
Yihe Deng
Weizhen He
Qihao Chen
Qingsong Xie
...
Feng Zhu
Rui Zhao
Wanli Ouyang
Donglian Qi
Yunfeng Yan
252
32
0
13 Jun 2023
Global and Local Semantic Completion Learning for Vision-Language
  Pre-training
Global and Local Semantic Completion Learning for Vision-Language Pre-training
Rong-Cheng Tu
Yatai Ji
Jie Jiang
Weijie Kong
Chengfei Cai
Wenzhe Zhao
Hongfa Wang
Yujiu Yang
Wei Liu
VLM
148
4
0
12 Jun 2023
Learning to Ground Instructional Articles in Videos through Narrations
Learning to Ground Instructional Articles in Videos through Narrations
E. Mavroudi
Triantafyllos Afouras
Lorenzo Torresani
DiffM
121
24
0
06 Jun 2023
LANISTR: Multimodal Learning from Structured and Unstructured Data
LANISTR: Multimodal Learning from Structured and Unstructured Data
Sayna Ebrahimi
Sercan O. Arik
Yihe Dong
Tomas Pfister
117
6
0
26 May 2023
TVTSv2: Learning Out-of-the-box Spatiotemporal Visual Representations at
  Scale
TVTSv2: Learning Out-of-the-box Spatiotemporal Visual Representations at Scale
Ziyun Zeng
Yixiao Ge
Zhan Tong
Xihui Liu
Shutao Xia
Ying Shan
123
10
0
23 May 2023
Learning Emotion Representations from Verbal and Nonverbal Communication
Learning Emotion Representations from Verbal and Nonverbal Communication
Sitao Zhang
Yimu Pan
Jianmin Wang
VLM
165
29
0
22 May 2023
Connecting Multi-modal Contrastive Representations
Connecting Multi-modal Contrastive Representations
Zehan Wang
Yang Zhao
Xize Cheng
Haifeng Huang
Jiageng Liu
...
Lin Li
Yongqiang Wang
Aoxiong Yin
Ziang Zhang
Zhou Zhao
105
31
0
22 May 2023
Paxion: Patching Action Knowledge in Video-Language Foundation Models
Paxion: Patching Action Knowledge in Video-Language Foundation Models
Zhenhailong Wang
Ansel Blume
Sha Li
Genglin Liu
Jaemin Cho
Zineng Tang
Joey Tianyi Zhou
Heng Ji
KELMVGen
91
34
0
18 May 2023
A Video Is Worth 4096 Tokens: Verbalize Videos To Understand Them In
  Zero Shot
A Video Is Worth 4096 Tokens: Verbalize Videos To Understand Them In Zero Shot
Aanisha Bhattacharya
Yaman Kumar Singla
Balaji Krishnamurthy
R. Shah
Changyou Chen
VGen
103
13
0
16 May 2023
Is a Video worth $n\times n$ Images? A Highly Efficient Approach to
  Transformer-based Video Question Answering
Is a Video worth n×nn\times nn×n Images? A Highly Efficient Approach to Transformer-based Video Question Answering
Chenyang Lyu
Tianbo Ji
Yvette Graham
Jennifer Foster
ViT
117
0
0
16 May 2023
An Inverse Scaling Law for CLIP Training
An Inverse Scaling Law for CLIP Training
Xianhang Li
Zeyu Wang
Cihang Xie
VLMCLIP
165
69
0
11 May 2023
Self-Chained Image-Language Model for Video Localization and Question
  Answering
Self-Chained Image-Language Model for Video Localization and Question Answering
Shoubin Yu
Jaemin Cho
Prateek Yadav
Joey Tianyi Zhou
207
159
0
11 May 2023
VideoChat: Chat-Centric Video Understanding
VideoChat: Chat-Centric Video Understanding
Kunchang Li
Yinan He
Yi Wang
Yizhuo Li
Wen Wang
Ping Luo
Yali Wang
Limin Wang
Yu Qiao
MLLM
244
658
0
10 May 2023
Visual Transformation Telling
Visual Transformation Telling
Wanqing Cui
Mustafa Nasir-Moin
Yanyan Lan
Viola J. Chen
Jiafeng Guo
Xueqi Cheng
LRM
164
1
0
03 May 2023
Implicit Temporal Modeling with Learnable Alignment for Video
  Recognition
Implicit Temporal Modeling with Learnable Alignment for Video Recognition
S. Tu
Qi Dai
Zuxuan Wu
Zhi-Qi Cheng
Hang-Rui Hu
Yu-Gang Jiang
137
47
0
20 Apr 2023
SViTT: Temporal Learning of Sparse Video-Text Transformers
SViTT: Temporal Learning of Sparse Video-Text Transformers
Yi Li
Kyle Min
Subarna Tripathi
Nuno Vasconcelos
77
16
0
18 Apr 2023
Pretrained Language Models as Visual Planners for Human Assistance
Pretrained Language Models as Visual Planners for Human Assistance
Dhruvesh Patel
H. Eghbalzadeh
Nitin Kamra
Michael L. Iuzzolino
Unnat Jain
Ruta Desai
LM&Ro
125
32
0
17 Apr 2023
Multimodal Representation Learning of Cardiovascular Magnetic Resonance
  Imaging
Multimodal Representation Learning of Cardiovascular Magnetic Resonance Imaging
Jielin Qiu
Peide Huang
Makiya Nakashima
Jae-Hyeok Lee
Jiacheng Zhu
...
Byung-Hak Kim
Debbie Kwon
Douglas Weber
Ding Zhao
David Chen
SSL
89
6
0
16 Apr 2023
LASER: A Neuro-Symbolic Framework for Learning Spatial-Temporal Scene Graphs with Weak Supervision
LASER: A Neuro-Symbolic Framework for Learning Spatial-Temporal Scene Graphs with Weak Supervision
Jiani Huang
Ziyang Li
Mayur Naik
Ser-Nam Lim
284
5
0
15 Apr 2023
Previous
123456789
Next