ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2026 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2109.04290
  4. Cited By
Improving Video-Text Retrieval by Multi-Stream Corpus Alignment and Dual
  Softmax Loss
v1v2v3 (latest)

Improving Video-Text Retrieval by Multi-Stream Corpus Alignment and Dual Softmax Loss

9 September 2021
Xingyi Cheng
Hezheng Lin
Xiangyu Wu
Fan Yang
Dong Shen
ArXiv (abs)PDFHTML

Papers citing "Improving Video-Text Retrieval by Multi-Stream Corpus Alignment and Dual Softmax Loss"

50 / 103 papers shown
Fine-grained Text-Video Retrieval with Frozen Image Encoders
Fine-grained Text-Video Retrieval with Frozen Image Encoders
Zuozhuo Dai
Fang Shao
Qingkun Su
Zilong Dong
Siyu Zhu
408
1
0
14 Jul 2023
COSA: Concatenated Sample Pretrained Vision-Language Foundation Model
COSA: Concatenated Sample Pretrained Vision-Language Foundation ModelInternational Conference on Learning Representations (ICLR), 2023
Sihan Chen
Xingjian He
Handong Li
Xiaojie Jin
Jiashi Feng
Qingbin Liu
VLMCLIP
197
11
0
15 Jun 2023
Global and Local Semantic Completion Learning for Vision-Language
  Pre-training
Global and Local Semantic Completion Learning for Vision-Language Pre-trainingIEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2023
Rong-Cheng Tu
Yatai Ji
Jie Jiang
Weijie Kong
Chengfei Cai
Wenzhe Zhao
Hongfa Wang
Yujiu Yang
Wei Liu
VLM
252
8
0
12 Jun 2023
VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and
  Dataset
VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and DatasetNeural Information Processing Systems (NeurIPS), 2023
Sihan Chen
Handong Li
Qunbo Wang
Zijia Zhao
Ming-Ting Sun
Xinxin Zhu
Qingbin Liu
506
171
0
29 May 2023
DirecT2V: Large Language Models are Frame-Level Directors for Zero-Shot
  Text-to-Video Generation
DirecT2V: Large Language Models are Frame-Level Directors for Zero-Shot Text-to-Video Generation
Susung Hong
Junyoung Seo
Heeseong Shin
Sung‐Jin Hong
Seung Wook Kim
DiffMVGen
281
53
0
23 May 2023
TVTSv2: Learning Out-of-the-box Spatiotemporal Visual Representations at
  Scale
TVTSv2: Learning Out-of-the-box Spatiotemporal Visual Representations at Scale
Ziyun Zeng
Yixiao Ge
Zhan Tong
Xihui Liu
Shutao Xia
Ying Shan
263
13
0
23 May 2023
VLAB: Enhancing Video Language Pre-training by Feature Adapting and
  Blending
VLAB: Enhancing Video Language Pre-training by Feature Adapting and BlendingIEEE transactions on multimedia (IEEE TMM), 2023
Xingjian He
Sihan Chen
Fan Ma
Zhicheng Huang
Xiaojie Jin
Zikang Liu
Dongmei Fu
Yi Yang
Qingbin Liu
Jiashi Feng
VLMCLIP
293
23
0
22 May 2023
Mask to reconstruct: Cooperative Semantics Completion for Video-text
  Retrieval
Mask to reconstruct: Cooperative Semantics Completion for Video-text RetrievalACM Multimedia (ACM MM), 2023
Han Fang
Zhifei Yang
Xianghao Zang
Chao Ban
Hao Sun
VGen
240
5
0
13 May 2023
A Large Cross-Modal Video Retrieval Dataset with Reading Comprehension
A Large Cross-Modal Video Retrieval Dataset with Reading ComprehensionPattern Recognition (Pattern Recogn.), 2023
Weijia Wu
Yuzhong Zhao
Zhuangzi Li
Jiahong Li
Hong Zhou
Mike Zheng Shou
Xiang Bai
197
34
0
05 May 2023
VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset
VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and DatasetIEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2023
Sihan Chen
Xingjian He
Longteng Guo
Xinxin Zhu
Weining Wang
Jinhui Tang
Jinhui Tang
VLM
382
150
0
17 Apr 2023
DATE: Domain Adaptive Product Seeker for E-commerce
DATE: Domain Adaptive Product Seeker for E-commerceComputer Vision and Pattern Recognition (CVPR), 2023
Haoyuan Li
Haojie Jiang
Tao Jin
Meng-Juan Li
Yan Chen
Zhijie Lin
Yang Zhao
Zhou Zhao
308
6
0
07 Apr 2023
Video-Text as Game Players: Hierarchical Banzhaf Interaction for
  Cross-Modal Representation Learning
Video-Text as Game Players: Hierarchical Banzhaf Interaction for Cross-Modal Representation LearningComputer Vision and Pattern Recognition (CVPR), 2023
Peng Jin
Jinfa Huang
Pengfei Xiong
Shangxuan Tian
Chang-rui Liu
Xiang Ji
Li-ming Yuan
Jie Chen
270
78
0
25 Mar 2023
Aligning Step-by-Step Instructional Diagrams to Video Demonstrations
Aligning Step-by-Step Instructional Diagrams to Video DemonstrationsComputer Vision and Pattern Recognition (CVPR), 2023
Jiahao Zhang
A. Cherian
Yanbin Liu
Yizhak Ben-Shabat
Cristian Rodriguez-Opazo
Stephen Gould
224
11
0
24 Mar 2023
Dialogue-to-Video Retrieval
Dialogue-to-Video RetrievalEuropean Conference on Information Retrieval (ECIR), 2023
Chenyang Lyu
Manh-Duy Nguyen
Van-Tu Ninh
Liting Zhou
C. Gurrin
Jennifer Foster
169
4
0
23 Mar 2023
MuLTI: Efficient Video-and-Language Understanding with Text-Guided
  MultiWay-Sampler and Multiple Choice Modeling
MuLTI: Efficient Video-and-Language Understanding with Text-Guided MultiWay-Sampler and Multiple Choice ModelingAAAI Conference on Artificial Intelligence (AAAI), 2023
Jiaqi Xu
Bo Liu
Yunkuo Chen
Mengli Cheng
Xing Shi
261
2
0
10 Mar 2023
Improving Text-Audio Retrieval by Text-aware Attention Pooling and Prior
  Matrix Revised Loss
Improving Text-Audio Retrieval by Text-aware Attention Pooling and Prior Matrix Revised LossIEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2023
Yifei Xin
Dongchao Yang
Yuexian Zou
382
32
0
10 Mar 2023
Towards Generalisable Video Moment Retrieval: Visual-Dynamic Injection
  to Image-Text Pre-Training
Towards Generalisable Video Moment Retrieval: Visual-Dynamic Injection to Image-Text Pre-TrainingComputer Vision and Pattern Recognition (CVPR), 2023
Dezhao Luo
Jiabo Huang
S. Gong
Hailin Jin
Yang Liu
VGen
319
41
0
28 Feb 2023
Deep Learning for Video-Text Retrieval: a Review
Deep Learning for Video-Text Retrieval: a ReviewInternational Journal of Multimedia Information Retrieval (IJMIR), 2023
Cunjuan Zhu
Qi Jia
Wei Chen
Yanming Guo
Yu Liu
226
28
0
24 Feb 2023
STOA-VLP: Spatial-Temporal Modeling of Object and Action for
  Video-Language Pre-training
STOA-VLP: Spatial-Temporal Modeling of Object and Action for Video-Language Pre-trainingAAAI Conference on Artificial Intelligence (AAAI), 2023
Weihong Zhong
Mao Zheng
Duyu Tang
Xuan Luo
Heng Gong
Xiaocheng Feng
Bing Qin
383
9
0
20 Feb 2023
Video-Text Retrieval by Supervised Sparse Multi-Grained Learning
Video-Text Retrieval by Supervised Sparse Multi-Grained LearningConference on Empirical Methods in Natural Language Processing (EMNLP), 2023
Yimu Wang
Peng Shi
229
9
0
19 Feb 2023
Tagging before Alignment: Integrating Multi-Modal Tags for Video-Text
  Retrieval
Tagging before Alignment: Integrating Multi-Modal Tags for Video-Text RetrievalAAAI Conference on Artificial Intelligence (AAAI), 2023
Yizhen Chen
Jie Wang
Lijian Lin
Chen Ma
Jin Ma
Ying Shan
VLM
245
34
0
30 Jan 2023
Revisiting Temporal Modeling for CLIP-based Image-to-Video Knowledge
  Transferring
Revisiting Temporal Modeling for CLIP-based Image-to-Video Knowledge TransferringComputer Vision and Pattern Recognition (CVPR), 2023
Ruyang Liu
Jingjia Huang
Ge Li
Jiashi Feng
Xing Wu
Thomas H. Li
AI4TSCLIPVLM
243
74
0
26 Jan 2023
UATVR: Uncertainty-Adaptive Text-Video Retrieval
UATVR: Uncertainty-Adaptive Text-Video RetrievalIEEE International Conference on Computer Vision (ICCV), 2023
Bo Fang
Wenhao Wu
Chang-rui Liu
Can Ma
Yuxin Song
Weiping Wang
Min Yang
Xiang Ji
Jingdong Wang
245
82
0
16 Jan 2023
HierVL: Learning Hierarchical Video-Language Embeddings
HierVL: Learning Hierarchical Video-Language EmbeddingsComputer Vision and Pattern Recognition (CVPR), 2023
Kumar Ashutosh
Rohit Girdhar
Lorenzo Torresani
Kristen Grauman
VLMAI4TS
434
70
0
05 Jan 2023
SimVTP: Simple Video Text Pre-training with Masked Autoencoders
SimVTP: Simple Video Text Pre-training with Masked Autoencoders
Yue Ma
Tianyu Yang
Yin Shan
Xiu Li
164
30
0
07 Dec 2022
InternVideo: General Video Foundation Models via Generative and
  Discriminative Learning
InternVideo: General Video Foundation Models via Generative and Discriminative Learning
Yi Wang
Kunchang Li
Yizhuo Li
Yinan He
Bingkun Huang
...
Junting Pan
Jiashuo Yu
Yali Wang
Limin Wang
Yu Qiao
VLMVGen
453
444
0
06 Dec 2022
Seeing What You Miss: Vision-Language Pre-training with Semantic
  Completion Learning
Seeing What You Miss: Vision-Language Pre-training with Semantic Completion LearningComputer Vision and Pattern Recognition (CVPR), 2022
Yatai Ji
Rong-Cheng Tu
Jie Jiang
Weijie Kong
Chengfei Cai
Wenzhe Zhao
Hongfa Wang
Yujiu Yang
Wei Liu
VLM
257
17
0
24 Nov 2022
Perceiver-VL: Efficient Vision-and-Language Modeling with Iterative
  Latent Attention
Perceiver-VL: Efficient Vision-and-Language Modeling with Iterative Latent AttentionIEEE Workshop/Winter Conference on Applications of Computer Vision (WACV), 2022
Zineng Tang
Jaemin Cho
Jie Lei
Joey Tianyi Zhou
VLM
178
10
0
21 Nov 2022
Expectation-Maximization Contrastive Learning for Compact
  Video-and-Language Representations
Expectation-Maximization Contrastive Learning for Compact Video-and-Language RepresentationsNeural Information Processing Systems (NeurIPS), 2022
Peng Jin
Jinfa Huang
Fenglin Liu
Xian Wu
Shen Ge
Guoli Song
David Clifton
Jing Chen
VLM
300
85
0
21 Nov 2022
Are All Combinations Equal? Combining Textual and Visual Features with
  Multiple Space Learning for Text-Based Video Retrieval
Are All Combinations Equal? Combining Textual and Visual Features with Multiple Space Learning for Text-Based Video Retrieval
Damianos Galanopoulos
Vasileios Mezaris
167
7
0
21 Nov 2022
Cross-Modal Adapter for Vision-Language Retrieval
Cross-Modal Adapter for Vision-Language RetrievalPattern Recognition (Pattern Recogn.), 2022
Haojun Jiang
Jianke Zhang
Rui Huang
Chunjiang Ge
Zanlin Ni
Jiwen Lu
Gao Huang
350
43
0
17 Nov 2022
Efficient Cross-Modal Video Retrieval with Meta-Optimized Frames
Efficient Cross-Modal Video Retrieval with Meta-Optimized FramesIEEE transactions on multimedia (IEEE TMM), 2022
Ning Han
Xun Yang
Ee-Peng Lim
Hao Chen
Qianru Sun
171
6
0
16 Oct 2022
RaP: Redundancy-aware Video-language Pre-training for Text-Video
  Retrieval
RaP: Redundancy-aware Video-language Pre-training for Text-Video RetrievalConference on Empirical Methods in Natural Language Processing (EMNLP), 2022
Xing Wu
Chaochen Gao
Zijia Lin
Zhongyuan Wang
Jizhong Han
Songlin Hu
148
10
0
13 Oct 2022
Contrastive Video-Language Learning with Fine-grained Frame Sampling
Contrastive Video-Language Learning with Fine-grained Frame Sampling
Zixu Wang
Yujie Zhong
Yishu Miao
Lin Ma
Lucia Specia
227
15
0
10 Oct 2022
TokenFlow: Rethinking Fine-grained Cross-modal Alignment in
  Vision-Language Retrieval
TokenFlow: Rethinking Fine-grained Cross-modal Alignment in Vision-Language Retrieval
Xiaohan Zou
Changqiao Wu
Lele Cheng
Zhongyuan Wang
262
7
0
28 Sep 2022
CLIP-ViP: Adapting Pre-trained Image-Text Model to Video-Language
  Representation Alignment
CLIP-ViP: Adapting Pre-trained Image-Text Model to Video-Language Representation AlignmentInternational Conference on Learning Representations (ICLR), 2022
Hongwei Xue
Yuchong Sun
Bei Liu
Jianlong Fu
Rui Song
Houqiang Li
Jiebo Luo
CLIPVLM
428
93
0
14 Sep 2022
MuMUR : Multilingual Multimodal Universal Retrieval
MuMUR : Multilingual Multimodal Universal Retrieval
Avinash Madasu
Estelle Aflalo
Gabriela Ben-Melech Stan
Shachar Rosenman
Shao-Yen Tseng
Gedas Bertasius
Vasudev Lal
406
6
0
24 Aug 2022
M2HF: Multi-level Multi-modal Hybrid Fusion for Text-Video Retrieval
M2HF: Multi-level Multi-modal Hybrid Fusion for Text-Video Retrieval
Shuo Liu
Weize Quan
Mingyuan Zhou
Sihong Chen
Jian Kang
Zhenlan Zhao
Chen Chen
Dong-Ming Yan
126
3
0
16 Aug 2022
Frozen CLIP Models are Efficient Video Learners
Frozen CLIP Models are Efficient Video LearnersEuropean Conference on Computer Vision (ECCV), 2022
Ziyi Lin
Shijie Geng
Renrui Zhang
Shiyang Feng
Gerard de Melo
Xiaogang Wang
Jifeng Dai
Yu Qiao
Jiaming Song
CLIPVLM
247
253
0
06 Aug 2022
Don't Stop Learning: Towards Continual Learning for the CLIP Model
Don't Stop Learning: Towards Continual Learning for the CLIP Model
Yuxuan Ding
Lingqiao Liu
Chunna Tian
Jingyuan Yang
Haoxuan Ding
CLLVLMKELM
221
69
0
19 Jul 2022
Clover: Towards A Unified Video-Language Alignment and Fusion Model
Clover: Towards A Unified Video-Language Alignment and Fusion ModelComputer Vision and Pattern Recognition (CVPR), 2022
Jingjia Huang
Yinan Li
Jiashi Feng
Xinglong Wu
Xiaoshuai Sun
Rongrong Ji
VLM
277
55
0
16 Jul 2022
TS2-Net: Token Shift and Selection Transformer for Text-Video Retrieval
TS2-Net: Token Shift and Selection Transformer for Text-Video RetrievalEuropean Conference on Computer Vision (ECCV), 2022
Yuqi Liu
Pengfei Xiong
Luhui Xu
Shengming Cao
Qin Jin
257
169
0
16 Jul 2022
LaT: Latent Translation with Cycle-Consistency for Video-Text Retrieval
LaT: Latent Translation with Cycle-Consistency for Video-Text Retrieval
Jinbin Bai
Chunhui Liu
Feiyue Ni
Haofan Wang
Mengying Hu
Xiaofeng Guo
Lele Cheng
182
14
0
11 Jul 2022
LAVENDER: Unifying Video-Language Understanding as Masked Language
  Modeling
LAVENDER: Unifying Video-Language Understanding as Masked Language ModelingComputer Vision and Pattern Recognition (CVPR), 2022
Linjie Li
Zhe Gan
Kevin Qinghong Lin
Chung-Ching Lin
Zicheng Liu
Ce Liu
Lijuan Wang
MLLMVLM
191
93
0
14 Jun 2022
A CLIP-Hitchhiker's Guide to Long Video Retrieval
A CLIP-Hitchhiker's Guide to Long Video Retrieval
Max Bain
Arsha Nagrani
Gül Varol
Andrew Zisserman
CLIP
414
73
0
17 May 2022
Zero-Shot Category-Level Object Pose Estimation
Zero-Shot Category-Level Object Pose EstimationEuropean Conference on Computer Vision (ECCV), 2022
Walter Goodwin
S. Vaze
Ioannis Havoutis
Ingmar Posner
ViT
301
64
0
07 Apr 2022
Tencent Text-Video Retrieval: Hierarchical Cross-Modal Interactions with
  Multi-Level Representations
Tencent Text-Video Retrieval: Hierarchical Cross-Modal Interactions with Multi-Level RepresentationsIEEE Access (IEEE Access), 2022
Jie Jiang
Shaobo Min
Weijie Kong
Dihong Gong
Hongfa Wang
Zhifeng Li
Wei Liu
VLM
331
30
0
07 Apr 2022
ECLIPSE: Efficient Long-range Video Retrieval using Sight and Sound
ECLIPSE: Efficient Long-range Video Retrieval using Sight and SoundEuropean Conference on Computer Vision (ECCV), 2022
Yan-Bo Lin
Jie Lei
Joey Tianyi Zhou
Gedas Bertasius
386
53
0
06 Apr 2022
Learning Audio-Video Modalities from Image Captions
Learning Audio-Video Modalities from Image CaptionsEuropean Conference on Computer Vision (ECCV), 2022
Arsha Nagrani
Paul Hongsuck Seo
Bryan Seybold
Anja Hauth
Santiago Manén
Chen Sun
Cordelia Schmid
CLIP
206
95
0
01 Apr 2022
Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language
Socratic Models: Composing Zero-Shot Multimodal Reasoning with LanguageInternational Conference on Learning Representations (ICLR), 2022
Andy Zeng
Maria Attarian
Brian Ichter
K. Choromanski
Adrian S. Wong
...
Michael S. Ryoo
Vikas Sindhwani
Johnny Lee
Vincent Vanhoucke
Peter R. Florence
ReLMLRM
555
681
0
01 Apr 2022
Previous
123
Next