Papers
Communities
Organizations
Events
Blog
Pricing
Feedback
Contact Sales
Search
Open menu
Home
Papers
2109.14084
Cited By
v1
v2 (latest)
VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding
28 September 2021
Hu Xu
Gargi Ghosh
Po-Yao (Bernie) Huang
Dmytro Okhonko
Armen Aghajanyan
Florian Metze
Luke Zettlemoyer
Florian Metze Luke Zettlemoyer Christoph Feichtenhofer
CLIP
VLM
Re-assign community
ArXiv (abs)
PDF
HTML
Github (31473★)
Papers citing
"VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding"
50 / 439 papers shown
Title
Video Understanding by Design: How Datasets Shape Architectures and Insights
Lei Wang
Piotr Koniusz
Yongsheng Gao
3DV
VGen
AI4TS
48
0
0
11 Sep 2025
Language-guided Recursive Spatiotemporal Graph Modeling for Video Summarization
Jungin Park
Jiyoung Lee
Kwanghoon Sohn
0
0
0
06 Sep 2025
Repeating Words for Video-Language Retrieval with Coarse-to-Fine Objectives
Haoyu Zhao
Jiaxi Gu
Shicong Wang
Xing Zhang
Hang Xu
Zuxuan Wu
Yu-Gang Jiang
16
0
0
20 Aug 2025
Beyond Simple Edits: Composed Video Retrieval with Dense Modifications
Omkar Thawakar
Dmitry Demidov
Ritesh Thawkar
Rao Muhammad Anwer
M. Shah
Fahad Shahbaz Khan
Salman Khan
VGen
12
0
0
19 Aug 2025
MimicFunc: Imitating Tool Manipulation from a Single Human Video via Functional Correspondence
Chao Tang
Anxing Xiao
Yuhong Deng
Tianrun Hu
Wenlong Dong
Hanbo Zhang
David Hsu
Hong Zhang
35
0
0
19 Aug 2025
Generating Dialogues from Egocentric Instructional Videos for Task Assistance: Dataset, Method and Benchmark
Lavisha Aggarwal
Vikas Bahirwani
Lin Li
Andrea Colaco
VGen
20
0
0
15 Aug 2025
Bridging Modality Gaps in e-Commerce Products via Vision-Language Alignment
Yipeng Zhang
Hongju Yu
Aritra Mandal
Canran Xu
Qunzhi Zhou
Zhe Wu
24
0
0
13 Aug 2025
MobileViCLIP: An Efficient Video-Text Model for Mobile Devices
Min Yang
Zihan Jia
Zhilin Dai
Sheng Guo
Limin Wang
CLIP
VLM
34
0
0
10 Aug 2025
Adversarial Video Promotion Against Text-to-Video Retrieval
Qiwei Tian
Chenhao Lin
Zhengyu Zhao
Qian Li
Shuai Liu
Chao Shen
AAML
34
0
0
09 Aug 2025
MoExDA: Domain Adaptation for Edge-based Action Recognition
Takuya Sugimoto
Ning Ding
Toru Tamaki
48
0
0
05 Aug 2025
TimeExpert: An Expert-Guided Video LLM for Video Temporal Grounding
Zuhao Yang
Yingchen Yu
Yunqing Zhao
Shijian Lu
Song Bai
54
0
0
03 Aug 2025
Context-Adaptive Multi-Prompt Embedding with Large Language Models for Vision-Language Alignment
Dahun Kim
A. Angelova
VLM
58
0
0
03 Aug 2025
Context-based Motion Retrieval using Open Vocabulary Methods for Autonomous Driving
Stefan Englmeier
Max A. Büttner
Katharina Winter
Fabian B. Flohr
81
0
0
01 Aug 2025
Punching Bag vs. Punching Person: Motion Transferability in Videos
Raiyaan Abdullah
Jared Claypoole
Michael Cogswell
Ajay Divakaran
Yogesh S Rawat
24
0
0
31 Jul 2025
HQ-CLIP: Leveraging Large Vision-Language Models to Create High-Quality Image-Text Datasets and CLIP Models
Zhixiang Wei
Guangting Wang
Xiaoxiao Ma
Ke Mei
H. Chen
Yi-jing Jin
Fengyun Rao
CLIP
MLLM
VLM
41
0
0
30 Jul 2025
Group Relative Augmentation for Data Efficient Action Detection
Deep Patel
Iain Melvin
Zachary Izzo
Martin Renqiang Min
VLM
35
0
0
28 Jul 2025
Learning Transferable Facial Emotion Representations from Large-Scale Semantically Rich Captions
Licai Sun
Xingxun Jiang
Haoyu Chen
Yante Li
Zheng Lian
B. Liu
Yuan Zong
Wenming Zheng
Jukka M. Leppänen
Guoying Zhao
CLIP
VLM
38
0
0
28 Jul 2025
Implicit Counterfactual Learning for Audio-Visual Segmentation
Mingfeng Zha
Tianyu Li
G. Wang
Peng Wang
Yangyang Wu
Yang Yang
Heng Tao Shen
VOS
CML
46
0
0
28 Jul 2025
Principled Multimodal Representation Learning
Xiaohao Liu
Xiaobo Xia
See-Kiong Ng
Tat-Seng Chua
41
1
0
23 Jul 2025
Large Language Models for Crash Detection in Video: A Survey of Methods, Datasets, and Challenges
Sanjeda Akter
Ibne Farabi Shihab
Anuj Sharma
VLM
77
1
0
02 Jul 2025
Bridging Brain with Foundation Models through Self-Supervised Learning
Hamdi Altaheri
Fakhri Karray
Md. Milon Islam
S M Taslim Uddin Raju
Amir-Hossein Karimi
83
0
0
19 Jun 2025
Can Vision Language Models Understand Mimed Actions?
Hyundong Justin Cho
Spencer Lin
Tejas Srinivasan
Michael Saxon
Deuksin Kwon
Natali T. Chavez
Jonathan May
72
1
0
17 Jun 2025
DAVID-XR1: Detecting AI-Generated Videos with Explainable Reasoning
Yifeng Gao
Yifan Ding
Hongyu Su
Juncheng Li
Yunhan Zhao
...
Li Wang
Xin Wang
Yixu Wang
Xingjun Ma
Yu-Gang Jiang
VGen
103
0
0
13 Jun 2025
EgoM2P: Egocentric Multimodal Multitask Pretraining
Gen Li
Yutong Chen
Yiqian Wu
Kaifeng Zhao
Marc Pollefeys
Siyu Tang
EgoV
VLM
152
0
0
09 Jun 2025
Hallucinate, Ground, Repeat: A Framework for Generalized Visual Relationship Detection
Shanmukha Vellamcheti
Sanjoy Kundu
Sathyanarayanan N. Aakur
107
0
0
06 Jun 2025
Aligning Multimodal Representations through an Information Bottleneck
Antonio Almudévar
José Miguel Hernández-Lobato
Sameer Khurana
R. Marxer
Alfonso Ortega
SSL
183
1
0
05 Jun 2025
WorldPrediction: A Benchmark for High-level World Modeling and Long-horizon Procedural Planning
Delong Chen
Willy Chung
Yejin Bang
Ziwei Ji
Pascale Fung
VGen
LM&Ro
142
2
0
04 Jun 2025
CogniAlign: Word-Level Multimodal Speech Alignment with Gated Cross-Attention for Alzheimer's Detection
David Ortiz-Perez
Manuel Benavent-Lledo
Javier Rodriguez-Juan
José García Rodríguez
David Tomás
118
0
0
02 Jun 2025
Scene Detection Policies and Keyframe Extraction Strategies for Large-Scale Video Analysis
Vasilii Korolkov
63
1
0
31 May 2025
VidText: Towards Comprehensive Evaluation for Video Text Understanding
Zhoufaran Yang
Yan Shu
Zhifei Yang
Yan Zhang
Yu-Hong Li
K. Lu
Gangyan Zeng
Shaohui Liu
Yu Zhou
N. Sebe
CoGe
125
2
0
28 May 2025
MUSEG: Reinforcing Video Temporal Understanding via Timestamp-Aware Multi-Segment Grounding
Fuwen Luo
Shengfeng Lou
C. L. Philip Chen
Ziyue Wang
Chenliang Li
...
Peng Li
Ming Yan
Ji Zhang
Fei Huang
Teli Ma
AI4TS
LRM
132
2
0
27 May 2025
Learning Shared Representations from Unpaired Data
Amitai Yacobi
Nir Ben-Ari
Ronen Talmon
Uri Shaham
SSL
122
0
0
23 May 2025
Video-GPT via Next Clip Diffusion
Shaobin Zhuang
Zhipeng Huang
Ying Zhang
Fangyikang Wang
Canmiao Fu
Binxin Yang
Chong Sun
Chen Li
Yali Wang
DiffM
VGen
377
2
0
18 May 2025
SafeVid: Toward Safety Aligned Video Large Multimodal Models
Yixu Wang
Jiaxin Song
Yifeng Gao
Xin Wang
Yang Yao
Yan Teng
Xingjun Ma
Yingchun Wang
Yu-Gang Jiang
219
1
0
17 May 2025
Position: Restructuring of Categories and Implementation of Guidelines Essential for VLM Adoption in Healthcare
Amara Tariq
Rimita Lahiri
Charles Kahn
Imon Banerjee
84
0
0
12 May 2025
Fine-Tuning Video-Text Contrastive Model for Primate Behavior Retrieval from Unlabeled Raw Videos
Giulio Cesare Mastrocinque Santo
Patrícia Izar
Irene Delval
Victor de Napole Gregolin
Nina S. T. Hirata
VGen
127
0
0
08 May 2025
T2VPhysBench: A First-Principles Benchmark for Physical Consistency in Text-to-Video Generation
Xuyang Guo
Jiayan Huo
Zhenmei Shi
Zhao Song
Jiahao Zhang
Jiale Zhao
EGVM
VGen
PINN
243
11
0
01 May 2025
Post-pre-training for Modality Alignment in Vision-Language Foundation Models
Shinýa Yamaguchi
Dewei Feng
Sekitoshi Kanai
Kazuki Adachi
Daiki Chijiwa
VLM
118
4
0
17 Apr 2025
AdaVid: Adaptive Video-Language Pretraining
Chaitanya Patel
Juan Carlos Niebles
Ehsan Adeli
VLM
54
0
0
16 Apr 2025
Co-STAR: Collaborative Curriculum Self-Training with Adaptive Regularization for Source-Free Video Domain Adaptation
Amirhossein Dadashzadeh
Parsa Esmati
Majid Mirmehdi
TTA
VLM
161
0
0
15 Apr 2025
CamMimic: Zero-Shot Image To Camera Motion Personalized Video Generation Using Diffusion Models
P. Guhan
D. Kothandaraman
Tsung-Wei Huang
Guan-Ming Su
Dinesh Manocha
DiffM
VGen
97
0
0
13 Apr 2025
Pose-Aware Weakly-Supervised Action Segmentation
Seth Z. Zhao
Reza Ghoddoosian
Isht Dwivedi
Nakul Agarwal
Behzad Dariush
160
0
0
08 Apr 2025
REVEAL: Relation-based Video Representation Learning for Video-Question-Answering
Sofian Chaybouti
Walid Bousselham
Moritz Wolter
Hilde Kuehne
493
0
0
07 Apr 2025
AVadCLIP: Audio-Visual Collaboration for Robust Video Anomaly Detection
Peng Wu
Wanshun Su
Guansong Pang
Yujia Sun
Qingsen Yan
Peng Wang
Yujiao Shi
VLM
144
1
0
06 Apr 2025
VideoComp: Advancing Fine-Grained Compositional and Temporal Alignment in Video-Text Models
Dahun Kim
A. Piergiovanni
Ganesh Mallya
A. Angelova
CoGe
201
1
0
04 Apr 2025
Is Temporal Prompting All We Need For Limited Labeled Action Recognition?
Shreyank N. Gowda
Boyan Gao
Xiao Gu
Xiaobo Jin
VLM
160
0
0
02 Apr 2025
A Survey on Music Generation from Single-Modal, Cross-Modal, and Multi-Modal Perspectives
Shuyu Li
Shulei Ji
Zihao Wang
Songruoyao Wu
Jiaxing Yu
Jianchao Tan
MGen
VGen
334
1
0
01 Apr 2025
COSMIC: Clique-Oriented Semantic Multi-space Integration for Robust CLIP Test-Time Adaptation
Fanding Huang
Jingyan Jiang
Qinting Jiang
Hebei Li
Faisal Nadeem Khan
Zhi Wang
VLM
156
1
0
30 Mar 2025
SocialGen: Modeling Multi-Human Social Interaction with Language Models
Heng Yu
Juze Zhang
Changan Chen
Tiange Xiang
Yusu Fang
Juan Carlos Niebles
Ehsan Adeli
VGen
148
2
0
28 Mar 2025
Vision-to-Music Generation: A Survey
Zhaokai Wang
Chenxi Bao
Le Zhuo
Jingrui Han
Yang Yue
Yihong Tang
Victor Shea-Jay Huang
Yue Liao
EGVM
VGen
186
1
0
27 Mar 2025
1
2
3
4
5
6
7
8
9
Next