ResearchTrend.AI
  • Papers
  • Communities
  • Organizations
  • Events
  • Blog
  • Pricing
  • Feedback
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2109.14084
  4. Cited By
VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text
  Understanding
v1v2 (latest)

VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding

28 September 2021
Hu Xu
Gargi Ghosh
Po-Yao (Bernie) Huang
Dmytro Okhonko
Armen Aghajanyan
Florian Metze
Luke Zettlemoyer
Florian Metze Luke Zettlemoyer Christoph Feichtenhofer
    CLIPVLM
ArXiv (abs)PDFHTMLGithub (31473★)

Papers citing "VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding"

50 / 439 papers shown
Title
Video Understanding by Design: How Datasets Shape Architectures and Insights
Video Understanding by Design: How Datasets Shape Architectures and Insights
Lei Wang
Piotr Koniusz
Yongsheng Gao
3DVVGenAI4TS
48
0
0
11 Sep 2025
Language-guided Recursive Spatiotemporal Graph Modeling for Video Summarization
Language-guided Recursive Spatiotemporal Graph Modeling for Video Summarization
Jungin Park
Jiyoung Lee
Kwanghoon Sohn
0
0
0
06 Sep 2025
Repeating Words for Video-Language Retrieval with Coarse-to-Fine Objectives
Repeating Words for Video-Language Retrieval with Coarse-to-Fine Objectives
Haoyu Zhao
Jiaxi Gu
Shicong Wang
Xing Zhang
Hang Xu
Zuxuan Wu
Yu-Gang Jiang
16
0
0
20 Aug 2025
Beyond Simple Edits: Composed Video Retrieval with Dense Modifications
Beyond Simple Edits: Composed Video Retrieval with Dense Modifications
Omkar Thawakar
Dmitry Demidov
Ritesh Thawkar
Rao Muhammad Anwer
M. Shah
Fahad Shahbaz Khan
Salman Khan
VGen
12
0
0
19 Aug 2025
MimicFunc: Imitating Tool Manipulation from a Single Human Video via Functional Correspondence
MimicFunc: Imitating Tool Manipulation from a Single Human Video via Functional Correspondence
Chao Tang
Anxing Xiao
Yuhong Deng
Tianrun Hu
Wenlong Dong
Hanbo Zhang
David Hsu
Hong Zhang
35
0
0
19 Aug 2025
Generating Dialogues from Egocentric Instructional Videos for Task Assistance: Dataset, Method and Benchmark
Generating Dialogues from Egocentric Instructional Videos for Task Assistance: Dataset, Method and Benchmark
Lavisha Aggarwal
Vikas Bahirwani
Lin Li
Andrea Colaco
VGen
20
0
0
15 Aug 2025
Bridging Modality Gaps in e-Commerce Products via Vision-Language Alignment
Bridging Modality Gaps in e-Commerce Products via Vision-Language Alignment
Yipeng Zhang
Hongju Yu
Aritra Mandal
Canran Xu
Qunzhi Zhou
Zhe Wu
24
0
0
13 Aug 2025
MobileViCLIP: An Efficient Video-Text Model for Mobile Devices
MobileViCLIP: An Efficient Video-Text Model for Mobile Devices
Min Yang
Zihan Jia
Zhilin Dai
Sheng Guo
Limin Wang
CLIPVLM
34
0
0
10 Aug 2025
Adversarial Video Promotion Against Text-to-Video Retrieval
Adversarial Video Promotion Against Text-to-Video Retrieval
Qiwei Tian
Chenhao Lin
Zhengyu Zhao
Qian Li
Shuai Liu
Chao Shen
AAML
34
0
0
09 Aug 2025
MoExDA: Domain Adaptation for Edge-based Action Recognition
MoExDA: Domain Adaptation for Edge-based Action Recognition
Takuya Sugimoto
Ning Ding
Toru Tamaki
48
0
0
05 Aug 2025
TimeExpert: An Expert-Guided Video LLM for Video Temporal Grounding
TimeExpert: An Expert-Guided Video LLM for Video Temporal Grounding
Zuhao Yang
Yingchen Yu
Yunqing Zhao
Shijian Lu
Song Bai
54
0
0
03 Aug 2025
Context-Adaptive Multi-Prompt Embedding with Large Language Models for Vision-Language Alignment
Context-Adaptive Multi-Prompt Embedding with Large Language Models for Vision-Language Alignment
Dahun Kim
A. Angelova
VLM
58
0
0
03 Aug 2025
Context-based Motion Retrieval using Open Vocabulary Methods for Autonomous Driving
Context-based Motion Retrieval using Open Vocabulary Methods for Autonomous Driving
Stefan Englmeier
Max A. Büttner
Katharina Winter
Fabian B. Flohr
81
0
0
01 Aug 2025
Punching Bag vs. Punching Person: Motion Transferability in Videos
Punching Bag vs. Punching Person: Motion Transferability in Videos
Raiyaan Abdullah
Jared Claypoole
Michael Cogswell
Ajay Divakaran
Yogesh S Rawat
24
0
0
31 Jul 2025
HQ-CLIP: Leveraging Large Vision-Language Models to Create High-Quality Image-Text Datasets and CLIP Models
HQ-CLIP: Leveraging Large Vision-Language Models to Create High-Quality Image-Text Datasets and CLIP Models
Zhixiang Wei
Guangting Wang
Xiaoxiao Ma
Ke Mei
H. Chen
Yi-jing Jin
Fengyun Rao
CLIPMLLMVLM
41
0
0
30 Jul 2025
Group Relative Augmentation for Data Efficient Action Detection
Group Relative Augmentation for Data Efficient Action Detection
Deep Patel
Iain Melvin
Zachary Izzo
Martin Renqiang Min
VLM
35
0
0
28 Jul 2025
Learning Transferable Facial Emotion Representations from Large-Scale Semantically Rich Captions
Learning Transferable Facial Emotion Representations from Large-Scale Semantically Rich Captions
Licai Sun
Xingxun Jiang
Haoyu Chen
Yante Li
Zheng Lian
B. Liu
Yuan Zong
Wenming Zheng
Jukka M. Leppänen
Guoying Zhao
CLIPVLM
38
0
0
28 Jul 2025
Implicit Counterfactual Learning for Audio-Visual Segmentation
Implicit Counterfactual Learning for Audio-Visual Segmentation
Mingfeng Zha
Tianyu Li
G. Wang
Peng Wang
Yangyang Wu
Yang Yang
Heng Tao Shen
VOSCML
46
0
0
28 Jul 2025
Principled Multimodal Representation Learning
Principled Multimodal Representation Learning
Xiaohao Liu
Xiaobo Xia
See-Kiong Ng
Tat-Seng Chua
41
1
0
23 Jul 2025
Large Language Models for Crash Detection in Video: A Survey of Methods, Datasets, and Challenges
Large Language Models for Crash Detection in Video: A Survey of Methods, Datasets, and Challenges
Sanjeda Akter
Ibne Farabi Shihab
Anuj Sharma
VLM
77
1
0
02 Jul 2025
Bridging Brain with Foundation Models through Self-Supervised Learning
Hamdi Altaheri
Fakhri Karray
Md. Milon Islam
S M Taslim Uddin Raju
Amir-Hossein Karimi
83
0
0
19 Jun 2025
Can Vision Language Models Understand Mimed Actions?
Can Vision Language Models Understand Mimed Actions?
Hyundong Justin Cho
Spencer Lin
Tejas Srinivasan
Michael Saxon
Deuksin Kwon
Natali T. Chavez
Jonathan May
72
1
0
17 Jun 2025
DAVID-XR1: Detecting AI-Generated Videos with Explainable Reasoning
DAVID-XR1: Detecting AI-Generated Videos with Explainable Reasoning
Yifeng Gao
Yifan Ding
Hongyu Su
Juncheng Li
Yunhan Zhao
...
Li Wang
Xin Wang
Yixu Wang
Xingjun Ma
Yu-Gang Jiang
VGen
103
0
0
13 Jun 2025
EgoM2P: Egocentric Multimodal Multitask Pretraining
EgoM2P: Egocentric Multimodal Multitask Pretraining
Gen Li
Yutong Chen
Yiqian Wu
Kaifeng Zhao
Marc Pollefeys
Siyu Tang
EgoVVLM
152
0
0
09 Jun 2025
Hallucinate, Ground, Repeat: A Framework for Generalized Visual Relationship Detection
Hallucinate, Ground, Repeat: A Framework for Generalized Visual Relationship Detection
Shanmukha Vellamcheti
Sanjoy Kundu
Sathyanarayanan N. Aakur
107
0
0
06 Jun 2025
Aligning Multimodal Representations through an Information Bottleneck
Antonio Almudévar
José Miguel Hernández-Lobato
Sameer Khurana
R. Marxer
Alfonso Ortega
SSL
183
1
0
05 Jun 2025
WorldPrediction: A Benchmark for High-level World Modeling and Long-horizon Procedural Planning
Delong Chen
Willy Chung
Yejin Bang
Ziwei Ji
Pascale Fung
VGenLM&Ro
142
2
0
04 Jun 2025
CogniAlign: Word-Level Multimodal Speech Alignment with Gated Cross-Attention for Alzheimer's Detection
CogniAlign: Word-Level Multimodal Speech Alignment with Gated Cross-Attention for Alzheimer's Detection
David Ortiz-Perez
Manuel Benavent-Lledo
Javier Rodriguez-Juan
José García Rodríguez
David Tomás
118
0
0
02 Jun 2025
Scene Detection Policies and Keyframe Extraction Strategies for Large-Scale Video Analysis
Scene Detection Policies and Keyframe Extraction Strategies for Large-Scale Video Analysis
Vasilii Korolkov
63
1
0
31 May 2025
VidText: Towards Comprehensive Evaluation for Video Text Understanding
VidText: Towards Comprehensive Evaluation for Video Text Understanding
Zhoufaran Yang
Yan Shu
Zhifei Yang
Yan Zhang
Yu-Hong Li
K. Lu
Gangyan Zeng
Shaohui Liu
Yu Zhou
N. Sebe
CoGe
125
2
0
28 May 2025
MUSEG: Reinforcing Video Temporal Understanding via Timestamp-Aware Multi-Segment Grounding
MUSEG: Reinforcing Video Temporal Understanding via Timestamp-Aware Multi-Segment Grounding
Fuwen Luo
Shengfeng Lou
C. L. Philip Chen
Ziyue Wang
Chenliang Li
...
Peng Li
Ming Yan
Ji Zhang
Fei Huang
Teli Ma
AI4TSLRM
132
2
0
27 May 2025
Learning Shared Representations from Unpaired Data
Learning Shared Representations from Unpaired Data
Amitai Yacobi
Nir Ben-Ari
Ronen Talmon
Uri Shaham
SSL
122
0
0
23 May 2025
Video-GPT via Next Clip Diffusion
Video-GPT via Next Clip Diffusion
Shaobin Zhuang
Zhipeng Huang
Ying Zhang
Fangyikang Wang
Canmiao Fu
Binxin Yang
Chong Sun
Chen Li
Yali Wang
DiffMVGen
377
2
0
18 May 2025
SafeVid: Toward Safety Aligned Video Large Multimodal Models
SafeVid: Toward Safety Aligned Video Large Multimodal Models
Yixu Wang
Jiaxin Song
Yifeng Gao
Xin Wang
Yang Yao
Yan Teng
Xingjun Ma
Yingchun Wang
Yu-Gang Jiang
219
1
0
17 May 2025
Position: Restructuring of Categories and Implementation of Guidelines Essential for VLM Adoption in Healthcare
Position: Restructuring of Categories and Implementation of Guidelines Essential for VLM Adoption in Healthcare
Amara Tariq
Rimita Lahiri
Charles Kahn
Imon Banerjee
84
0
0
12 May 2025
Fine-Tuning Video-Text Contrastive Model for Primate Behavior Retrieval from Unlabeled Raw Videos
Fine-Tuning Video-Text Contrastive Model for Primate Behavior Retrieval from Unlabeled Raw Videos
Giulio Cesare Mastrocinque Santo
Patrícia Izar
Irene Delval
Victor de Napole Gregolin
Nina S. T. Hirata
VGen
127
0
0
08 May 2025
T2VPhysBench: A First-Principles Benchmark for Physical Consistency in Text-to-Video Generation
T2VPhysBench: A First-Principles Benchmark for Physical Consistency in Text-to-Video Generation
Xuyang Guo
Jiayan Huo
Zhenmei Shi
Zhao Song
Jiahao Zhang
Jiale Zhao
EGVMVGenPINN
243
11
0
01 May 2025
Post-pre-training for Modality Alignment in Vision-Language Foundation Models
Post-pre-training for Modality Alignment in Vision-Language Foundation Models
Shinýa Yamaguchi
Dewei Feng
Sekitoshi Kanai
Kazuki Adachi
Daiki Chijiwa
VLM
118
4
0
17 Apr 2025
AdaVid: Adaptive Video-Language Pretraining
AdaVid: Adaptive Video-Language Pretraining
Chaitanya Patel
Juan Carlos Niebles
Ehsan Adeli
VLM
54
0
0
16 Apr 2025
Co-STAR: Collaborative Curriculum Self-Training with Adaptive Regularization for Source-Free Video Domain Adaptation
Co-STAR: Collaborative Curriculum Self-Training with Adaptive Regularization for Source-Free Video Domain Adaptation
Amirhossein Dadashzadeh
Parsa Esmati
Majid Mirmehdi
TTAVLM
161
0
0
15 Apr 2025
CamMimic: Zero-Shot Image To Camera Motion Personalized Video Generation Using Diffusion Models
CamMimic: Zero-Shot Image To Camera Motion Personalized Video Generation Using Diffusion Models
P. Guhan
D. Kothandaraman
Tsung-Wei Huang
Guan-Ming Su
Dinesh Manocha
DiffMVGen
97
0
0
13 Apr 2025
Pose-Aware Weakly-Supervised Action Segmentation
Pose-Aware Weakly-Supervised Action Segmentation
Seth Z. Zhao
Reza Ghoddoosian
Isht Dwivedi
Nakul Agarwal
Behzad Dariush
160
0
0
08 Apr 2025
REVEAL: Relation-based Video Representation Learning for Video-Question-Answering
REVEAL: Relation-based Video Representation Learning for Video-Question-Answering
Sofian Chaybouti
Walid Bousselham
Moritz Wolter
Hilde Kuehne
493
0
0
07 Apr 2025
AVadCLIP: Audio-Visual Collaboration for Robust Video Anomaly Detection
AVadCLIP: Audio-Visual Collaboration for Robust Video Anomaly Detection
Peng Wu
Wanshun Su
Guansong Pang
Yujia Sun
Qingsen Yan
Peng Wang
Yujiao Shi
VLM
144
1
0
06 Apr 2025
VideoComp: Advancing Fine-Grained Compositional and Temporal Alignment in Video-Text Models
VideoComp: Advancing Fine-Grained Compositional and Temporal Alignment in Video-Text Models
Dahun Kim
A. Piergiovanni
Ganesh Mallya
A. Angelova
CoGe
201
1
0
04 Apr 2025
Is Temporal Prompting All We Need For Limited Labeled Action Recognition?
Is Temporal Prompting All We Need For Limited Labeled Action Recognition?
Shreyank N. Gowda
Boyan Gao
Xiao Gu
Xiaobo Jin
VLM
160
0
0
02 Apr 2025
A Survey on Music Generation from Single-Modal, Cross-Modal, and Multi-Modal Perspectives
A Survey on Music Generation from Single-Modal, Cross-Modal, and Multi-Modal Perspectives
Shuyu Li
Shulei Ji
Zihao Wang
Songruoyao Wu
Jiaxing Yu
Jianchao Tan
MGenVGen
334
1
0
01 Apr 2025
COSMIC: Clique-Oriented Semantic Multi-space Integration for Robust CLIP Test-Time Adaptation
COSMIC: Clique-Oriented Semantic Multi-space Integration for Robust CLIP Test-Time Adaptation
Fanding Huang
Jingyan Jiang
Qinting Jiang
Hebei Li
Faisal Nadeem Khan
Zhi Wang
VLM
156
1
0
30 Mar 2025
SocialGen: Modeling Multi-Human Social Interaction with Language Models
SocialGen: Modeling Multi-Human Social Interaction with Language Models
Heng Yu
Juze Zhang
Changan Chen
Tiange Xiang
Yusu Fang
Juan Carlos Niebles
Ehsan Adeli
VGen
148
2
0
28 Mar 2025
Vision-to-Music Generation: A Survey
Vision-to-Music Generation: A Survey
Zhaokai Wang
Chenxi Bao
Le Zhuo
Jingrui Han
Yang Yue
Yihong Tang
Victor Shea-Jay Huang
Yue Liao
EGVMVGen
186
1
0
27 Mar 2025
123456789
Next