ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2026 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2201.04850
  4. Cited By
Bridging Video-text Retrieval with Multiple Choice Questions
v1v2 (latest)

Bridging Video-text Retrieval with Multiple Choice Questions

Computer Vision and Pattern Recognition (CVPR), 2022
13 January 2022
Yuying Ge
Yixiao Ge
Xihui Liu
Dian Li
Ying Shan
Xiaohu Qie
Ping Luo
    BDL
ArXiv (abs)PDFHTML

Papers citing "Bridging Video-text Retrieval with Multiple Choice Questions"

50 / 73 papers shown
Table Comprehension in Building Codes using Vision Language Models and Domain-Specific Fine-Tuning
Table Comprehension in Building Codes using Vision Language Models and Domain-Specific Fine-Tuning
Mohammad Aqib
Mohd Hamza
Ying Hei Chui
Qipei Mei
LMTD
445
0
0
23 Nov 2025
MSAM: Multi-Semantic Adaptive Mining for Cross-Modal Drone Video-Text Retrieval
MSAM: Multi-Semantic Adaptive Mining for Cross-Modal Drone Video-Text Retrieval
J. Huang
Yaxiong Chen
Ganchao Liu
152
0
0
17 Oct 2025
TCMA: Text-Conditioned Multi-granularity Alignment for Drone Cross-Modal Text-Video Retrieval
TCMA: Text-Conditioned Multi-granularity Alignment for Drone Cross-Modal Text-Video Retrieval
Zixu Zhao
Yang Zhan
VGenAI4TS
158
1
0
11 Oct 2025
Adversarial Video Promotion Against Text-to-Video Retrieval
Adversarial Video Promotion Against Text-to-Video Retrieval
Qiwei Tian
Chenhao Lin
Zhengyu Zhao
Qian Li
Shuai Liu
Chao Shen
AAML
209
0
0
09 Aug 2025
GAIS: Frame-Level Gated Audio-Visual Integration with Semantic Variance-Scaled Perturbation for Text-Video Retrieval
GAIS: Frame-Level Gated Audio-Visual Integration with Semantic Variance-Scaled Perturbation for Text-Video Retrieval
Bowen Yang
Yun Cao
Chen He
Xiaosu Su
VGen
218
0
0
03 Aug 2025
MamFusion: Multi-Mamba with Temporal Fusion for Partially Relevant Video Retrieval
MamFusion: Multi-Mamba with Temporal Fusion for Partially Relevant Video Retrieval
Xinru Ying
Jiaqi Mo
Jingyang Lin
Canghong Jin
Fangfang Wang
Lina Wei
244
0
0
04 Jun 2025
Leveraging Modality Tags for Enhanced Cross-Modal Video Retrieval
Leveraging Modality Tags for Enhanced Cross-Modal Video Retrieval
A. Fragomeni
Dima Damen
Michael Wray
609
1
0
02 Apr 2025
HierarQ: Task-Aware Hierarchical Q-Former for Enhanced Video Understanding
HierarQ: Task-Aware Hierarchical Q-Former for Enhanced Video UnderstandingComputer Vision and Pattern Recognition (CVPR), 2025
Shehreen Azad
Vibhav Vineet
Yogesh S Rawat
VLM
1.1K
15
0
11 Mar 2025
Foundation Models and Adaptive Feature Selection: A Synergistic Approach
  to Video Question Answering
Foundation Models and Adaptive Feature Selection: A Synergistic Approach to Video Question AnsweringIEEE Workshop/Winter Conference on Applications of Computer Vision (WACV), 2024
Sai Bhargav Rongali
M. Cui
Ankit Jha
Neha Bhargava
Saurabh Prasad
Biplab Banerjee
322
1
0
12 Dec 2024
FLAASH: Flow-Attention Adaptive Semantic Hierarchical Fusion for
  Multi-Modal Tobacco Content Analysis
FLAASH: Flow-Attention Adaptive Semantic Hierarchical Fusion for Multi-Modal Tobacco Content Analysis
N. V. R. Chappa
P. Dobbs
Bhiksha Raj
Khoa Luu
395
3
0
25 Oct 2024
Enhancing Video-Language Representations with Structural Spatio-Temporal
  Alignment
Enhancing Video-Language Representations with Structural Spatio-Temporal Alignment
Hao Fei
Shengqiong Wu
Meishan Zhang
Hao Fei
Tat-Seng Chua
Shuicheng Yan
AI4TS
310
73
0
27 Jun 2024
HENASY: Learning to Assemble Scene-Entities for Egocentric
  Video-Language Model
HENASY: Learning to Assemble Scene-Entities for Egocentric Video-Language Model
Khoa T. Vo
Thinh Phan
Kashu Yamazaki
Minh-Triet Tran
Ngan Le
367
1
0
01 Jun 2024
Composed Video Retrieval via Enriched Context and Discriminative
  Embeddings
Composed Video Retrieval via Enriched Context and Discriminative Embeddings
Omkar Thawakar
Muzammal Naseer
Rao Muhammad Anwer
Salman Khan
Michael Felsberg
Mubarak Shah
Fahad Shahbaz Khan
307
24
0
25 Mar 2024
VidLA: Video-Language Alignment at Scale
VidLA: Video-Language Alignment at ScaleComputer Vision and Pattern Recognition (CVPR), 2024
Mamshad Nayeem Rizve
Fan Fei
Jayakrishnan Unnikrishnan
Son Tran
Benjamin Z. Yao
Belinda Zeng
Mubarak Shah
Trishul Chilimbi
VLMAI4TS
268
8
0
21 Mar 2024
VideoMamba: State Space Model for Efficient Video Understanding
VideoMamba: State Space Model for Efficient Video UnderstandingEuropean Conference on Computer Vision (ECCV), 2024
Kunchang Li
Xinhao Li
Yi Wang
Yinan He
Yali Wang
Limin Wang
Yu Qiao
Mamba
346
459
0
11 Mar 2024
Panda-70M: Captioning 70M Videos with Multiple Cross-Modality Teachers
Panda-70M: Captioning 70M Videos with Multiple Cross-Modality Teachers
Tsai-Shien Chen
Aliaksandr Siarohin
Willi Menapace
Ekaterina Deyneka
Hsiang-wei Chao
...
Yuwei Fang
Hsin-Ying Lee
Jian Ren
Ming-Hsuan Yang
Sergey Tulyakov
VGen
482
378
0
29 Feb 2024
BDIQA: A New Dataset for Video Question Answering to Explore Cognitive
  Reasoning through Theory of Mind
BDIQA: A New Dataset for Video Question Answering to Explore Cognitive Reasoning through Theory of MindAAAI Conference on Artificial Intelligence (AAAI), 2024
Yuanyuan Mao
Xin Lin
Qin Ni
Liang He
311
6
0
12 Feb 2024
M2-RAAP: A Multi-Modal Recipe for Advancing Adaptation-based
  Pre-training towards Effective and Efficient Zero-shot Video-text Retrieval
M2-RAAP: A Multi-Modal Recipe for Advancing Adaptation-based Pre-training towards Effective and Efficient Zero-shot Video-text Retrieval
Xingning Dong
Zipeng Feng
Chunluan Zhou
Xuzheng Yu
Ming Yang
Qingpei Guo
VLM
291
5
0
31 Jan 2024
SNP-S3: Shared Network Pre-training and Significant Semantic
  Strengthening for Various Video-Text Tasks
SNP-S3: Shared Network Pre-training and Significant Semantic Strengthening for Various Video-Text Tasks
Xingning Dong
Qingpei Guo
Tian Gan
Qing Wang
Yue Yu
Xiangyuan Ren
Yuan Cheng
Wei Chu
257
6
0
31 Jan 2024
Multi-granularity Correspondence Learning from Long-term Noisy Videos
Multi-granularity Correspondence Learning from Long-term Noisy Videos
Yijie Lin
Jie Zhang
Zhenyu Huang
Jia-Wei Liu
Zujie Wen
Xi Peng
418
39
0
30 Jan 2024
Towards Efficient and Effective Text-to-Video Retrieval with
  Coarse-to-Fine Visual Representation Learning
Towards Efficient and Effective Text-to-Video Retrieval with Coarse-to-Fine Visual Representation LearningAAAI Conference on Artificial Intelligence (AAAI), 2024
Kaibin Tian
Yanhua Cheng
Yi Liu
Xinglin Hou
Quan Chen
Han Li
183
18
0
01 Jan 2024
A Language-based solution to enable Metaverse Retrieval
A Language-based solution to enable Metaverse Retrieval
Ali Abdari
Alex Falcon
Giuseppe Serra
DiffM
359
9
0
22 Dec 2023
WAVER: Writing-style Agnostic Text-Video Retrieval via Distilling
  Vision-Language Models Through Open-Vocabulary Knowledge
WAVER: Writing-style Agnostic Text-Video Retrieval via Distilling Vision-Language Models Through Open-Vocabulary KnowledgeIEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2023
Huy Le
Tung Kieu
Anh Nguyen
Ngan Le
VGen
351
7
0
15 Dec 2023
RGNet: A Unified Clip Retrieval and Grounding Network for Long Videos
RGNet: A Unified Clip Retrieval and Grounding Network for Long Videos
Tanveer Hannan
Md. Mohaiminul Islam
Thomas Seidl
Gedas Bertasius
564
12
0
11 Dec 2023
Leveraging Generative Language Models for Weakly Supervised Sentence
  Component Analysis in Video-Language Joint Learning
Leveraging Generative Language Models for Weakly Supervised Sentence Component Analysis in Video-Language Joint Learning
Zaber Ibn Abdul Hakim
Najibul Haque Sarker
Rahul Pratap Singh
Bishmoy Paul
Ali Dabouei
Min Xu
383
1
0
10 Dec 2023
Mug-STAN: Adapting Image-Language Pretrained Models for General Video
  Understanding
Mug-STAN: Adapting Image-Language Pretrained Models for General Video Understanding
Ruyang Liu
Jingjia Huang
Wei-Nan Gao
Thomas H. Li
Ge Li
VLM
307
4
0
25 Nov 2023
ViLMA: A Zero-Shot Benchmark for Linguistic and Temporal Grounding in
  Video-Language Models
ViLMA: A Zero-Shot Benchmark for Linguistic and Temporal Grounding in Video-Language ModelsInternational Conference on Learning Representations (ICLR), 2023
.Ilker Kesen
Andrea Pedrotti
Mustafa Dogan
Michele Cafagna
Emre Can Acikgoz
...
Iacer Calixto
Anette Frank
Albert Gatt
Aykut Erdem
Erkut Erdem
304
25
0
13 Nov 2023
Harvest Video Foundation Models via Efficient Post-Pretraining
Harvest Video Foundation Models via Efficient Post-Pretraining
Yizhuo Li
Kunchang Li
Yinan He
Yi Wang
Yali Wang
Limin Wang
Yu Qiao
Ping Luo
CLIPVLMVGen
400
3
0
30 Oct 2023
Video-adverb retrieval with compositional adverb-action embeddings
Video-adverb retrieval with compositional adverb-action embeddingsBritish Machine Vision Conference (BMVC), 2023
Thomas Hummel
Otniel-Bogdan Mercea
A. Sophia Koepke
Zeynep Akata
230
1
0
26 Sep 2023
VidChapters-7M: Video Chapters at Scale
VidChapters-7M: Video Chapters at ScaleNeural Information Processing Systems (NeurIPS), 2023
Antoine Yang
Arsha Nagrani
Ivan Laptev
Josef Sivic
Cordelia Schmid
VGen
276
44
0
25 Sep 2023
Dual-view Curricular Optimal Transport for Cross-lingual Cross-modal
  Retrieval
Dual-view Curricular Optimal Transport for Cross-lingual Cross-modal RetrievalIEEE Transactions on Image Processing (IEEE TIP), 2023
Yabing Wang
Shuhui Wang
Hao Luo
Jianfeng Dong
F. Wang
Meng Han
Xun Wang
Meng Wang
243
15
0
11 Sep 2023
Beyond One-to-One: Rethinking the Referring Image Segmentation
Beyond One-to-One: Rethinking the Referring Image SegmentationIEEE International Conference on Computer Vision (ICCV), 2023
Yutao Hu
Qixiong Wang
Wenqi Shao
Enze Xie
Zhenguo Li
Jungong Han
Ping Luo
3DV
356
84
0
26 Aug 2023
Prompt Switch: Efficient CLIP Adaptation for Text-Video Retrieval
Prompt Switch: Efficient CLIP Adaptation for Text-Video RetrievalIEEE International Conference on Computer Vision (ICCV), 2023
Chaorui Deng
Qi Chen
Pengda Qin
Dave Zhenyu Chen
Qi Wu
VLMCLIP
293
48
0
15 Aug 2023
Audio-Enhanced Text-to-Video Retrieval using Text-Conditioned Feature
  Alignment
Audio-Enhanced Text-to-Video Retrieval using Text-Conditioned Feature AlignmentIEEE International Conference on Computer Vision (ICCV), 2023
Sarah Ibrahimi
Xiaohang Sun
Pichao Wang
Amanmeet Garg
Ashutosh Sanan
Mohamed Omar
342
37
0
24 Jul 2023
Towards Video Anomaly Retrieval from Video Anomaly Detection: New
  Benchmarks and Model
Towards Video Anomaly Retrieval from Video Anomaly Detection: New Benchmarks and ModelIEEE Transactions on Image Processing (IEEE TIP), 2023
Peng Wu
Jing Liu
Xiangteng He
Yuxin Peng
Peng Wang
Yanning Zhang
471
53
0
24 Jul 2023
MultiVENT: Multilingual Videos of Events with Aligned Natural Text
MultiVENT: Multilingual Videos of Events with Aligned Natural TextNeural Information Processing Systems (NeurIPS), 2023
Kate Sanders
David Etter
Reno Kriz
Benjamin Van Durme
VGen
352
8
0
06 Jul 2023
ICSVR: Investigating Compositional and Syntactic Understanding in Video
  Retrieval Models
ICSVR: Investigating Compositional and Syntactic Understanding in Video Retrieval Models
Avinash Madasu
Vasudev Lal
CoGe
350
5
0
28 Jun 2023
COSA: Concatenated Sample Pretrained Vision-Language Foundation Model
COSA: Concatenated Sample Pretrained Vision-Language Foundation ModelInternational Conference on Learning Representations (ICLR), 2023
Sihan Chen
Xingjian He
Handong Li
Xiaojie Jin
Jiashi Feng
Qingbin Liu
VLMCLIP
243
11
0
15 Jun 2023
Global and Local Semantic Completion Learning for Vision-Language
  Pre-training
Global and Local Semantic Completion Learning for Vision-Language Pre-trainingIEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2023
Rong-Cheng Tu
Yatai Ji
Jie Jiang
Weijie Kong
Chengfei Cai
Wenzhe Zhao
Hongfa Wang
Yujiu Yang
Wei Liu
VLM
289
13
0
12 Jun 2023
Youku-mPLUG: A 10 Million Large-scale Chinese Video-Language Dataset for
  Pre-training and Benchmarks
Youku-mPLUG: A 10 Million Large-scale Chinese Video-Language Dataset for Pre-training and Benchmarks
Haiyang Xu
Qinghao Ye
Xuan-Wei Wu
Mingshi Yan
Yuan Miao
...
Qingfang Qian
Maofei Que
Ji Zhang
Xiaoyan Zeng
Feiyan Huang
VLMMLLM
199
34
0
07 Jun 2023
Recent Advances of Local Mechanisms in Computer Vision: A Survey and
  Outlook of Recent Work
Recent Advances of Local Mechanisms in Computer Vision: A Survey and Outlook of Recent Work
Qiangchang Wang
Yilong Yin
350
1
0
02 Jun 2023
TVTSv2: Learning Out-of-the-box Spatiotemporal Visual Representations at
  Scale
TVTSv2: Learning Out-of-the-box Spatiotemporal Visual Representations at Scale
Ziyun Zeng
Yixiao Ge
Zhan Tong
Xihui Liu
Shutao Xia
Ying Shan
324
14
0
23 May 2023
VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset
VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and DatasetIEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2023
Sihan Chen
Xingjian He
Longteng Guo
Xinxin Zhu
Weining Wang
Jinhui Tang
Jinhui Tang
VLM
508
173
0
17 Apr 2023
Video ChatCaptioner: Towards Enriched Spatiotemporal Descriptions
Video ChatCaptioner: Towards Enriched Spatiotemporal Descriptions
Jun Chen
Deyao Zhu
Kilichbek Haydarov
Xiang Li
Mohamed Elhoseiny
325
47
0
09 Apr 2023
Semantic Human Parsing via Scalable Semantic Transfer over Multiple
  Label Domains
Semantic Human Parsing via Scalable Semantic Transfer over Multiple Label DomainsComputer Vision and Pattern Recognition (CVPR), 2023
Jie Yang
Chaoqun Wang
Zhen Li
Junle Wang
Ruimao Zhang
193
20
0
09 Apr 2023
Procedure-Aware Pretraining for Instructional Video Understanding
Procedure-Aware Pretraining for Instructional Video UnderstandingComputer Vision and Pattern Recognition (CVPR), 2023
Honglu Zhou
Roberto Martín-Martín
Mubbasir Kapadia
Silvio Savarese
Juan Carlos Niebles
339
63
0
31 Mar 2023
Structured Video-Language Modeling with Temporal Grouping and Spatial
  Grounding
Structured Video-Language Modeling with Temporal Grouping and Spatial GroundingInternational Conference on Learning Representations (ICLR), 2023
Yuanhao Xiong
Long Zhao
Boqing Gong
Ming-Hsuan Yang
Florian Schroff
Ting Liu
Cho-Jui Hsieh
Liangzhe Yuan
VLM
350
0
0
28 Mar 2023
Learning Grounded Vision-Language Representation for Versatile
  Understanding in Untrimmed Videos
Learning Grounded Vision-Language Representation for Versatile Understanding in Untrimmed Videos
Teng Wang
Jinrui Zhang
Feng Zheng
Wenhao Jiang
Ran Cheng
Ping Luo
VLM
317
15
0
11 Mar 2023
Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense
  Video Captioning
Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense Video CaptioningComputer Vision and Pattern Recognition (CVPR), 2023
Antoine Yang
Arsha Nagrani
Paul Hongsuck Seo
Antoine Miech
Jordi Pont-Tuset
Ivan Laptev
Josef Sivic
Cordelia Schmid
AI4TSVLM
584
358
0
27 Feb 2023
STOA-VLP: Spatial-Temporal Modeling of Object and Action for
  Video-Language Pre-training
STOA-VLP: Spatial-Temporal Modeling of Object and Action for Video-Language Pre-trainingAAAI Conference on Artificial Intelligence (AAAI), 2023
Weihong Zhong
Mao Zheng
Duyu Tang
Xuan Luo
Heng Gong
Xiaocheng Feng
Bing Qin
460
9
0
20 Feb 2023
12
Next
Page 1 of 2