ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2026 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2102.12443
  4. Cited By
A Straightforward Framework For Video Retrieval Using CLIP
v1v2 (latest)

A Straightforward Framework For Video Retrieval Using CLIP

Mexican Conference on Pattern Recognition (MPR), 2021
24 February 2021
Jesús Andrés Portillo-Quintero
J. C. Ortíz-Bayliss
Hugo Terashima-Marín
    CLIP
ArXiv (abs)PDFHTMLGithub (70★)

Papers citing "A Straightforward Framework For Video Retrieval Using CLIP"

50 / 64 papers shown
MSAM: Multi-Semantic Adaptive Mining for Cross-Modal Drone Video-Text Retrieval
MSAM: Multi-Semantic Adaptive Mining for Cross-Modal Drone Video-Text Retrieval
J. Huang
Yaxiong Chen
Ganchao Liu
154
0
0
17 Oct 2025
VC-Agent: An Interactive Agent for Customized Video Dataset Collection
VC-Agent: An Interactive Agent for Customized Video Dataset Collection
Yidan Zhang
Mutian Xu
Yiming Hao
Kun Zhou
Jiahao Chang
Xiaoqiang Liu
Pengfei Wan
Hongbo Fu
Xiaoguang Han
VGen
206
1
0
25 Sep 2025
BiListing: Modality Alignment for Listings
BiListing: Modality Alignment for Listings
Guillaume Guy
Mihajlo Grbovic
Chun How Tan
Han Zhao
217
0
0
28 Aug 2025
Adversarial Video Promotion Against Text-to-Video Retrieval
Adversarial Video Promotion Against Text-to-Video Retrieval
Qiwei Tian
Chenhao Lin
Zhengyu Zhao
Qian Li
Shuai Liu
Chao Shen
AAMLMDE
227
1
0
09 Aug 2025
Tevatron 2.0: Unified Document Retrieval Toolkit across Scale, Language, and Modality
Tevatron 2.0: Unified Document Retrieval Toolkit across Scale, Language, and ModalityAnnual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), 2025
Xueguang Ma
Luyu Gao
Shengyao Zhuang
Jiaqi Samantha Zhan
Jamie Callan
Jimmy Lin
1.0K
18
0
05 May 2025
Detecting Content Rating Violations in Android Applications: A Vision-Language Approach
Detecting Content Rating Violations in Android Applications: A Vision-Language Approach
Dishanika Denipitiyage
B. Silva
Suranga Seneviratne
A. Seneviratne
Sanjay Chawla
256
0
0
07 Feb 2025
Optimized two-stage AI-based Neural Decoding for Enhanced Visual
  Stimulus Reconstruction from fMRI Data
Optimized two-stage AI-based Neural Decoding for Enhanced Visual Stimulus Reconstruction from fMRI DataJournal of Neural Engineering (J. Neural Eng.), 2024
Lorenzo Veronese
Andrea Moglia
Luca Mainardi
Pietro Cerveri
DiffM
355
1
0
17 Dec 2024
TokenBinder: Text-Video Retrieval with One-to-Many Alignment Paradigm
TokenBinder: Text-Video Retrieval with One-to-Many Alignment ParadigmIEEE Workshop/Winter Conference on Applications of Computer Vision (WACV), 2024
Bingqing Zhang
Zhuo Cao
Heming Du
Xin Yu
Xue Li
Jiajun Liu
Sen Wang
VGen
283
7
0
30 Sep 2024
From a Social Cognitive Perspective: Context-aware Visual Social
  Relationship Recognition
From a Social Cognitive Perspective: Context-aware Visual Social Relationship Recognition
Shiwei Wu
Chao Zhang
Joya Chen
Tong Xu
Likang Wu
Yao Hu
Enhong Chen
209
2
0
12 Jun 2024
ProTA: Probabilistic Token Aggregation for Text-Video Retrieval
ProTA: Probabilistic Token Aggregation for Text-Video Retrieval
Han Fang
Xianghao Zang
Chao Ban
Zerun Feng
Lanxiang Zhou
Zhongjiang He
Yongxiang Li
Hao Sun
403
3
0
18 Apr 2024
Text Is MASS: Modeling as Stochastic Embedding for Text-Video Retrieval
Text Is MASS: Modeling as Stochastic Embedding for Text-Video Retrieval
Jiamian Wang
Guohao Sun
Pichao Wang
Dongfang Liu
S. Dianat
Majid Rabbani
Raghuveer M. Rao
Zhiqiang Tao
VGen
490
78
0
26 Mar 2024
Mug-STAN: Adapting Image-Language Pretrained Models for General Video
  Understanding
Mug-STAN: Adapting Image-Language Pretrained Models for General Video Understanding
Ruyang Liu
Jingjia Huang
Wei-Nan Gao
Thomas H. Li
Ge Li
VLM
312
4
0
25 Nov 2023
Videoprompter: an ensemble of foundational models for zero-shot video
  understanding
Videoprompter: an ensemble of foundational models for zero-shot video understanding
Adeel Yousaf
Muzammal Naseer
Salman Khan
Fahad Shahbaz Khan
Mubarak Shah
VLM
279
3
0
23 Oct 2023
Encoding and Decoding Narratives: Datafication and Alternative Access
  Models for Audiovisual Archives
Encoding and Decoding Narratives: Datafication and Alternative Access Models for Audiovisual ArchivesACM Multimedia (ACM MM), 2023
Yuchen Yang
215
1
0
10 Oct 2023
Write What You Want: Applying Text-to-video Retrieval to Audiovisual
  Archives
Write What You Want: Applying Text-to-video Retrieval to Audiovisual ArchivesACM Journal on Computing and Cultural Heritage (JOCCH), 2023
Yuchen Yang
VGen
229
9
0
09 Oct 2023
Building an Open-Vocabulary Video CLIP Model with Better Architectures,
  Optimization and Data
Building an Open-Vocabulary Video CLIP Model with Better Architectures, Optimization and DataIEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2023
Zuxuan Wu
Zejia Weng
Wujian Peng
Xitong Yang
Ang Li
Larry S. Davis
Yu-Gang Jiang
CLIPVLM
298
31
0
08 Oct 2023
HowToCaption: Prompting LLMs to Transform Video Annotations at Scale
HowToCaption: Prompting LLMs to Transform Video Annotations at ScaleEuropean Conference on Computer Vision (ECCV), 2023
Nina Shvetsova
Anna Kukleva
Xudong Hong
Christian Rupprecht
Bernt Schiele
Hilde Kuehne
377
34
0
07 Oct 2023
LanguageBind: Extending Video-Language Pretraining to N-modality by
  Language-based Semantic Alignment
LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic AlignmentInternational Conference on Learning Representations (ICLR), 2023
Bin Zhu
Bin Lin
Munan Ning
Yang Yan
Jiaxi Cui
...
Zongwei Li
Wancai Zhang
Zhifeng Li
Wei Liu
Liejie Yuan
VLMMLLM
947
398
0
03 Oct 2023
Prototype-based Aleatoric Uncertainty Quantification for Cross-modal
  Retrieval
Prototype-based Aleatoric Uncertainty Quantification for Cross-modal RetrievalNeural Information Processing Systems (NeurIPS), 2023
Hao Li
Marie-Jeanne Lesot
Lianli Gao
Xiaosu Zhu
Christophe Marsala
EDL
339
38
0
29 Sep 2023
Unified Coarse-to-Fine Alignment for Video-Text Retrieval
Unified Coarse-to-Fine Alignment for Video-Text RetrievalIEEE International Conference on Computer Vision (ICCV), 2023
Ziyang Wang
Yi-Lin Sung
Feng Cheng
Gedas Bertasius
Joey Tianyi Zhou
470
89
0
18 Sep 2023
In-Style: Bridging Text and Uncurated Videos with Style Transfer for
  Text-Video Retrieval
In-Style: Bridging Text and Uncurated Videos with Style Transfer for Text-Video RetrievalIEEE International Conference on Computer Vision (ICCV), 2023
Nina Shvetsova
Anna Kukleva
Bernt Schiele
Hilde Kuehne
DiffM
269
7
0
16 Sep 2023
Representation Learning for Sequential Volumetric Design Tasks
Representation Learning for Sequential Volumetric Design Tasks
Md Ferdous Alam
Yi Wang
Linh Tran
Chin-Yi Cheng
Jieliang Luo
3DV
322
3
0
05 Sep 2023
Multi-event Video-Text Retrieval
Multi-event Video-Text RetrievalIEEE International Conference on Computer Vision (ICCV), 2023
Gengyuan Zhang
Jisen Ren
Jindong Gu
Volker Tresp
255
18
0
22 Aug 2023
MSVD-Indonesian: A Benchmark for Multimodal Video-Text Tasks in Indonesian
MSVD-Indonesian: A Benchmark for Multimodal Video-Text Tasks in Indonesian
Willy Fitra Hendria
290
4
0
20 Jun 2023
TVTSv2: Learning Out-of-the-box Spatiotemporal Visual Representations at
  Scale
TVTSv2: Learning Out-of-the-box Spatiotemporal Visual Representations at Scale
Ziyun Zeng
Yixiao Ge
Zhan Tong
Xihui Liu
Shutao Xia
Ying Shan
327
14
0
23 May 2023
i-Code Studio: A Configurable and Composable Framework for Integrative
  AI
i-Code Studio: A Configurable and Composable Framework for Integrative AIConference on Empirical Methods in Natural Language Processing (EMNLP), 2023
Yuwei Fang
Mahmoud Khademi
Chenguang Zhu
Ziyi Yang
Reid Pryzant
...
Yao Qian
Takuya Yoshioka
Lu Yuan
Michael Zeng
Xuedong Huang
246
2
0
23 May 2023
Mask to reconstruct: Cooperative Semantics Completion for Video-text
  Retrieval
Mask to reconstruct: Cooperative Semantics Completion for Video-text RetrievalACM Multimedia (ACM MM), 2023
Han Fang
Zhifei Yang
Xianghao Zang
Chao Ban
Hao Sun
VGen
308
8
0
13 May 2023
Visual Reasoning: from State to Transformation
Visual Reasoning: from State to TransformationIEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2023
Xin Hong
Yanyan Lan
Liang Pang
Jiafeng Guo
Xueqi Cheng
LRM
244
4
0
02 May 2023
Verbs in Action: Improving verb understanding in video-language models
Verbs in Action: Improving verb understanding in video-language modelsIEEE International Conference on Computer Vision (ICCV), 2023
Liliane Momeni
Mathilde Caron
Arsha Nagrani
Andrew Zisserman
Cordelia Schmid
549
91
0
13 Apr 2023
Accommodating Audio Modality in CLIP for Multimodal Processing
Accommodating Audio Modality in CLIP for Multimodal ProcessingAAAI Conference on Artificial Intelligence (AAAI), 2023
Ludan Ruan
Anwen Hu
Yuqing Song
Liang Zhang
S. Zheng
Qin Jin
VLM
248
18
0
12 Mar 2023
VITR: Augmenting Vision Transformers with Relation-Focused Learning for
  Cross-Modal Information Retrieval
VITR: Augmenting Vision Transformers with Relation-Focused Learning for Cross-Modal Information RetrievalACM Transactions on Knowledge Discovery from Data (TKDD), 2023
Yansong Gong
Georgina Cosma
Axel Finke
ViT
363
4
0
13 Feb 2023
Tagging before Alignment: Integrating Multi-Modal Tags for Video-Text
  Retrieval
Tagging before Alignment: Integrating Multi-Modal Tags for Video-Text RetrievalAAAI Conference on Artificial Intelligence (AAAI), 2023
Yizhen Chen
Jie Wang
Lijian Lin
Chen Ma
Jin Ma
Ying Shan
VLM
300
37
0
30 Jan 2023
UATVR: Uncertainty-Adaptive Text-Video Retrieval
UATVR: Uncertainty-Adaptive Text-Video RetrievalIEEE International Conference on Computer Vision (ICCV), 2023
Bo Fang
Wenhao Wu
Chang-rui Liu
Can Ma
Yuxin Song
Weiping Wang
Min Yang
Xiang Ji
Jingdong Wang
380
97
0
16 Jan 2023
VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive
  Captioners
VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive Captioners
Shen Yan
Tao Zhu
Zirui Wang
Yuan Cao
Mi Zhang
Soham Ghosh
Yonghui Wu
Jiahui Yu
VLMVGen
401
78
0
09 Dec 2022
Are All Combinations Equal? Combining Textual and Visual Features with
  Multiple Space Learning for Text-Based Video Retrieval
Are All Combinations Equal? Combining Textual and Visual Features with Multiple Space Learning for Text-Based Video Retrieval
Damianos Galanopoulos
Vasileios Mezaris
275
7
0
21 Nov 2022
I Can't Believe There's No Images! Learning Visual Tasks Using only
  Language Supervision
I Can't Believe There's No Images! Learning Visual Tasks Using only Language SupervisionIEEE International Conference on Computer Vision (ICCV), 2022
Sophia Gu
Christopher Clark
Aniruddha Kembhavi
VLM
431
39
0
17 Nov 2022
Boosting Video-Text Retrieval with Explicit High-Level Semantics
Boosting Video-Text Retrieval with Explicit High-Level SemanticsACM Multimedia (ACM MM), 2022
Haoran Wang
Di Xu
Dongliang He
Fu Li
Zhong Ji
Jungong Han
Errui Ding
259
16
0
08 Aug 2022
X-CLIP: End-to-End Multi-grained Contrastive Learning for Video-Text
  Retrieval
X-CLIP: End-to-End Multi-grained Contrastive Learning for Video-Text RetrievalACM Multimedia (ACM MM), 2022
Yiwei Ma
Guohai Xu
Xiaoshuai Sun
Ming Yan
Ji Zhang
Rongrong Ji
CLIPVLM
336
436
0
15 Jul 2022
CoCa: Contrastive Captioners are Image-Text Foundation Models
CoCa: Contrastive Captioners are Image-Text Foundation Models
Jiahui Yu
Zirui Wang
Vijay Vasudevan
Legg Yeung
Mojtaba Seyedhosseini
Yonghui Wu
VLMCLIPOffRL
937
1,699
0
04 May 2022
MILES: Visual BERT Pre-training with Injected Language Semantics for
  Video-text Retrieval
MILES: Visual BERT Pre-training with Injected Language Semantics for Video-text RetrievalEuropean Conference on Computer Vision (ECCV), 2022
Yuying Ge
Yixiao Ge
Xihui Liu
Alex Jinpeng Wang
Jianping Wu
Ying Shan
Xiaohu Qie
Ping Luo
VLM
191
49
0
26 Apr 2022
Tencent Text-Video Retrieval: Hierarchical Cross-Modal Interactions with
  Multi-Level Representations
Tencent Text-Video Retrieval: Hierarchical Cross-Modal Interactions with Multi-Level RepresentationsIEEE Access (IEEE Access), 2022
Jie Jiang
Shaobo Min
Weijie Kong
Dihong Gong
Hongfa Wang
Zhifeng Li
Wei Liu
VLM
447
32
0
07 Apr 2022
ECLIPSE: Efficient Long-range Video Retrieval using Sight and Sound
ECLIPSE: Efficient Long-range Video Retrieval using Sight and SoundEuropean Conference on Computer Vision (ECCV), 2022
Yan-Bo Lin
Jie Lei
Joey Tianyi Zhou
Gedas Bertasius
456
57
0
06 Apr 2022
Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language
Socratic Models: Composing Zero-Shot Multimodal Reasoning with LanguageInternational Conference on Learning Representations (ICLR), 2022
Andy Zeng
Maria Attarian
Brian Ichter
K. Choromanski
Adrian S. Wong
...
Michael S. Ryoo
Vikas Sindhwani
Johnny Lee
Vincent Vanhoucke
Peter R. Florence
ReLMLRM
790
715
0
01 Apr 2022
GEB+: A Benchmark for Generic Event Boundary Captioning, Grounding and Retrieval
GEB+: A Benchmark for Generic Event Boundary Captioning, Grounding and RetrievalEuropean Conference on Computer Vision (ECCV), 2022
Yuxuan Wang
Difei Gao
Licheng Yu
Stan Weixian Lei
Matt Feiszli
Mike Zheng Shou
684
29
0
01 Apr 2022
X-Pool: Cross-Modal Language-Video Attention for Text-Video Retrieval
X-Pool: Cross-Modal Language-Video Attention for Text-Video RetrievalComputer Vision and Pattern Recognition (CVPR), 2022
S. Gorti
Noël Vouitsis
Junwei Ma
Keyvan Golestan
Anthony L. Caterini
Animesh Garg
Guangwei Yu
395
243
0
28 Mar 2022
Disentangled Representation Learning for Text-Video Retrieval
Disentangled Representation Learning for Text-Video Retrieval
Qiang Wang
Yanhao Zhang
Yun Zheng
Pan Pan
Xiansheng Hua
263
105
0
14 Mar 2022
MDMMT-2: Multidomain Multimodal Transformer for Video Retrieval, One
  More Step Towards Generalization
MDMMT-2: Multidomain Multimodal Transformer for Video Retrieval, One More Step Towards Generalization
Alexander Kunitsyn
M. Kalashnikov
Maksim Dzabraev
Andrei Ivaniuta
210
18
0
14 Mar 2022
Geodesic Multi-Modal Mixup for Robust Fine-Tuning
Geodesic Multi-Modal Mixup for Robust Fine-TuningNeural Information Processing Systems (NeurIPS), 2022
Changdae Oh
Junhyuk So
Hoyoon Byun
Yongtaek Lim
Minchul Shin
Jong-June Jeon
Kyungwoo Song
523
43
0
08 Mar 2022
Bridging Video-text Retrieval with Multiple Choice Questions
Bridging Video-text Retrieval with Multiple Choice QuestionsComputer Vision and Pattern Recognition (CVPR), 2022
Yuying Ge
Yixiao Ge
Xihui Liu
Dian Li
Ying Shan
Xiaohu Qie
Ping Luo
BDL
396
126
0
13 Jan 2022
Multi-Query Video Retrieval
Multi-Query Video RetrievalEuropean Conference on Computer Vision (ECCV), 2022
Zeyu Wang
Yu Wu
Karthik Narasimhan
Olga Russakovsky
336
25
0
10 Jan 2022
12
Next
Page 1 of 2