Papers
Communities
Organizations
Events
Blog
Pricing
Feedback
Contact Sales
Search
Open menu
Home
Papers
2109.14084
Cited By
v1
v2 (latest)
VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding
28 September 2021
Hu Xu
Gargi Ghosh
Po-Yao (Bernie) Huang
Dmytro Okhonko
Armen Aghajanyan
Florian Metze
Luke Zettlemoyer
Florian Metze Luke Zettlemoyer Christoph Feichtenhofer
CLIP
VLM
Re-assign community
ArXiv (abs)
PDF
HTML
Github (31473★)
Papers citing
"VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding"
39 / 439 papers shown
Title
Triangular Contrastive Learning on Molecular Graphs
MinGyu Choi
Wonseok Shin
Yijingxiu Lu
Sun Kim
74
2
0
26 May 2022
mPLUG: Effective and Efficient Vision-Language Learning by Cross-modal Skip-connections
Chenliang Li
Haiyang Xu
Junfeng Tian
Wei Wang
Ming Yan
...
Ji Zhang
Songfang Huang
Feiran Huang
Jingren Zhou
Luo Si
VLM
MLLM
137
243
0
24 May 2022
Language Models with Image Descriptors are Strong Few-Shot Video-Language Learners
Zhenhailong Wang
Pengfei Yu
Ruochen Xu
Luowei Zhou
Jie Lei
...
Chenguang Zhu
Derek Hoiem
Shih-Fu Chang
Joey Tianyi Zhou
Heng Ji
MLLM
VLM
318
154
0
22 May 2022
A CLIP-Hitchhiker's Guide to Long Video Retrieval
Max Bain
Arsha Nagrani
Gül Varol
Andrew Zisserman
CLIP
242
65
0
17 May 2022
Multimodal Conversational AI: A Survey of Datasets and Approaches
Anirudh S. Sundar
Larry Heck
114
32
0
13 May 2022
Language Models Can See: Plugging Visual Controls in Text Generation
Yixuan Su
Tian Lan
Yahui Liu
Fangyu Liu
Dani Yogatama
Yan Wang
Lingpeng Kong
Nigel Collier
VLM
MLLM
151
104
0
05 May 2022
P3IV: Probabilistic Procedure Planning from Instructional Videos with Weak Supervision
Henghui Zhao
Isma Hadji
Nikita Dvornik
Konstantinos G. Derpanis
Richard P. Wildes
Allan D. Jepson
118
48
0
04 May 2022
i-Code: An Integrative and Composable Multimodal Learning Framework
Ziyi Yang
Yuwei Fang
Chenguang Zhu
Reid Pryzant
DongDong Chen
...
Bin Xiao
Yuanxun Lu
Takuya Yoshioka
Michael Zeng
Xuedong Huang
127
50
0
03 May 2022
Retrieval-Enhanced Machine Learning
Hamed Zamani
Fernando Diaz
Mostafa Dehghani
Donald Metzler
Michael Bendersky
88
55
0
02 May 2022
CenterCLIP: Token Clustering for Efficient Text-Video Retrieval
Shuai Zhao
Linchao Zhu
Xiaohan Wang
Yi Yang
VLM
CLIP
106
134
0
02 May 2022
Where in the World is this Image? Transformer-based Geo-localization in the Wild
Shraman Pramanick
E. Nowara
Joshua Gleason
Carlos D. Castillo
Rama Chellappa
ViT
106
47
0
29 Apr 2022
MILES: Visual BERT Pre-training with Injected Language Semantics for Video-text Retrieval
Yuying Ge
Yixiao Ge
Xihui Liu
Alex Jinpeng Wang
Jianping Wu
Ying Shan
Xiaohu Qie
Ping Luo
VLM
89
44
0
26 Apr 2022
Contrastive Language-Action Pre-training for Temporal Localization
Mengmeng Xu
Erhan Gundogdu
⋆⋆ Maksim
Guohao Li
M. Donoser
Loris Bazzani
126
27
0
26 Apr 2022
Modality-Balanced Embedding for Video Retrieval
Xun Wang
Bingqing Ke
Xuanping Li
Fangyu Liu
Mingyu Zhang
Xiao Liang
Qi-En Xiao
Cheng Luo
Yue Yu
76
11
0
18 Apr 2022
Tencent Text-Video Retrieval: Hierarchical Cross-Modal Interactions with Multi-Level Representations
Jie Jiang
Shaobo Min
Weijie Kong
Dihong Gong
Hongfa Wang
Zhifeng Li
Wei Liu
VLM
171
23
0
07 Apr 2022
Temporal Alignment Networks for Long-term Video
Tengda Han
Weidi Xie
Andrew Zisserman
AI4TS
115
96
0
06 Apr 2022
ECLIPSE: Efficient Long-range Video Retrieval using Sight and Sound
Yan-Bo Lin
Jie Lei
Joey Tianyi Zhou
Gedas Bertasius
194
45
0
06 Apr 2022
Learning Commonsense-aware Moment-Text Alignment for Fast Video Temporal Grounding
Ziyue Wu
Junyu Gao
Shucheng Huang
Changsheng Xu
124
4
0
04 Apr 2022
Learning Audio-Video Modalities from Image Captions
Arsha Nagrani
Paul Hongsuck Seo
Bryan Seybold
Anja Hauth
Santiago Manén
Chen Sun
Cordelia Schmid
CLIP
122
92
0
01 Apr 2022
Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language
Andy Zeng
Maria Attarian
Brian Ichter
K. Choromanski
Adrian S. Wong
...
Michael S. Ryoo
Vikas Sindhwani
Johnny Lee
Vincent Vanhoucke
Peter R. Florence
ReLM
LRM
381
618
0
01 Apr 2022
mcBERT: Momentum Contrastive Learning with BERT for Zero-Shot Slot Filling
Seong-Hwan Heo
WonKee Lee
Jong-Hyeok Lee
98
4
0
24 Mar 2022
R3M: A Universal Visual Representation for Robot Manipulation
Suraj Nair
Aravind Rajeswaran
Vikash Kumar
Chelsea Finn
Abhi Gupta
LM&Ro
199
652
0
23 Mar 2022
Revitalize Region Feature for Democratizing Video-Language Pre-training of Retrieval
Guanyu Cai
Yixiao Ge
Binjie Zhang
Alex Jinpeng Wang
Rui Yan
...
Ying Shan
Lianghua He
Xiaohu Qie
Jianping Wu
Mike Zheng Shou
VLM
102
6
0
15 Mar 2022
Synopses of Movie Narratives: a Video-Language Dataset for Story Understanding
Yidan Sun
Qin Chao
Yangfeng Ji
Boyang Albert Li
VGen
165
11
0
11 Mar 2022
Mind the Gap: Understanding the Modality Gap in Multi-modal Contrastive Representation Learning
Weixin Liang
Yuhui Zhang
Yongchan Kwon
Serena Yeung
James Zou
VLM
212
487
0
03 Mar 2022
VScript: Controllable Script Generation with Visual Presentation
Ziwei Ji
Yan Xu
I-Tsun Cheng
Samuel Cahyawijaya
Rita Frieske
Etsuko Ishii
Mini Zeng
Andrea Madotto
Pascale Fung
127
4
0
01 Mar 2022
BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation
Junnan Li
Dongxu Li
Caiming Xiong
Guosheng Lin
MLLM
BDL
VLM
CLIP
717
4,874
0
28 Jan 2022
Learning To Recognize Procedural Activities with Distant Supervision
Xudong Lin
Fabio Petroni
Gedas Bertasius
Marcus Rohrbach
Shih-Fu Chang
Lorenzo Torresani
146
91
0
26 Jan 2022
Temporal Sentence Grounding in Videos: A Survey and Future Directions
Hao Zhang
Aixin Sun
Wei Jing
Qiufeng Wang
3DGS
157
44
0
20 Jan 2022
Bridging Video-text Retrieval with Multiple Choice Questions
Yuying Ge
Yixiao Ge
Xihui Liu
Dian Li
Ying Shan
Xiaohu Qie
Ping Luo
BDL
145
113
0
13 Jan 2022
MERLOT Reserve: Neural Script Knowledge through Vision and Language and Sound
Rowan Zellers
Jiasen Lu
Ximing Lu
Youngjae Yu
Yanpeng Zhao
Mohammadreza Salehi
Aditya Kusupati
Jack Hessel
Ali Farhadi
Yejin Choi
240
225
0
07 Jan 2022
Progressive Video Summarization via Multimodal Self-supervised Learning
Haopeng Li
Qiuhong Ke
Mingming Gong
Tom Drummond
AI4TS
106
24
0
07 Jan 2022
Align and Prompt: Video-and-Language Pre-training with Entity Prompts
Dongxu Li
Junnan Li
Hongdong Li
Juan Carlos Niebles
Guosheng Lin
151
198
0
17 Dec 2021
Video-Text Pre-training with Learned Regions
Rui Yan
Mike Zheng Shou
Yixiao Ge
Alex Jinpeng Wang
Xudong Lin
Guanyu Cai
Jinhui Tang
148
25
0
02 Dec 2021
A Simple Long-Tailed Recognition Baseline via Vision-Language Model
Teli Ma
Shijie Geng
Mengmeng Wang
Jing Shao
Jiasen Lu
Hongsheng Li
Shiyang Feng
Yu Qiao
VLM
150
55
0
29 Nov 2021
VIOLET : End-to-End Video-Language Transformers with Masked Visual-token Modeling
Tsu-Jui Fu
Linjie Li
Zhe Gan
Kevin Qinghong Lin
Wenjie Wang
Lijuan Wang
Zicheng Liu
VLM
227
232
0
24 Nov 2021
Florence: A New Foundation Model for Computer Vision
Lu Yuan
Dongdong Chen
Yi-Ling Chen
Noel Codella
Xiyang Dai
...
Zhen Xiao
Jianwei Yang
Michael Zeng
Luowei Zhou
Pengchuan Zhang
VLM
285
952
0
22 Nov 2021
Advancing High-Resolution Video-Language Representation with Large-Scale Video Transcriptions
Hongwei Xue
Tiankai Hang
Yanhong Zeng
Yuchong Sun
Bei Liu
Huan Yang
Jianlong Fu
B. Guo
AI4TS
VLM
106
216
0
19 Nov 2021
Transcript to Video: Efficient Clip Sequencing from Texts
Yu Xiong
Fabian Caba Heilbron
Dahua Lin
CLIP
87
11
0
25 Jul 2021
Previous
1
2
3
4
5
6
7
8
9