Communities
Connect sessions
AI calendar
Organizations
Join Slack
Contact Sales
Search
Open menu
Home
Papers
2011.07231
Cited By
ActBERT: Learning Global-Local Video-Text Representations
Computer Vision and Pattern Recognition (CVPR), 2020
14 November 2020
Linchao Zhu
Yi Yang
ViT
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"ActBERT: Learning Global-Local Video-Text Representations"
50 / 278 papers shown
EEA: Exploration-Exploitation Agent for Long Video Understanding
Te Yang
Xiangyu Zhu
Bo Wang
Quan Chen
Peng Jiang
Zhen Lei
100
0
0
03 Dec 2025
Vote-in-Context: Turning VLMs into Zero-Shot Rank Fusers
Mohamed Eltahir
Ali Habibullah
Lama Ayash
Tanveer Hussain
Naeemullah Khan
111
1
0
03 Nov 2025
iFinder: Structured Zero-Shot Vision-Based LLM Grounding for Dash-Cam Video Reasoning
Manyi Yao
Bingbing Zhuang
Sparsh Garg
Amit Roy-Chowdhury
Christian Shelton
Manmohan Chandraker
Abhishek Aich
LRM
302
1
0
23 Sep 2025
Repeating Words for Video-Language Retrieval with Coarse-to-Fine Objectives
Haoyu Zhao
Jiaxi Gu
Shicong Wang
Xing Zhang
Hang Xu
Zuxuan Wu
Yu-Gang Jiang
191
0
0
20 Aug 2025
Adversarial Video Promotion Against Text-to-Video Retrieval
Qiwei Tian
Chenhao Lin
Zhengyu Zhao
Qian Li
Shuai Liu
Chao Shen
AAML
199
0
0
09 Aug 2025
ActAlign: Zero-Shot Fine-Grained Video Classification via Language-Guided Sequence Alignment
Amir Aghdam
Vincent Tao Hu
Bjorn Ommer
VLM
372
2
0
28 Jun 2025
Leveraging Auxiliary Information in Text-to-Video Retrieval: A Review
A. Fragomeni
Dima Damen
Michael Wray
266
0
0
29 May 2025
TC-MGC: Text-Conditioned Multi-Grained Contrastive Learning for Text-Video Retrieval
Information Fusion (Inf. Fusion), 2025
Xiaolun Jing
Genke Yang
Jian Chu
255
4
0
07 Apr 2025
Video-ColBERT: Contextualized Late Interaction for Text-to-Video Retrieval
Computer Vision and Pattern Recognition (CVPR), 2025
Arun V. Reddy
Alexander Martin
Eugene Yang
Andrew Yates
Kate Sanders
Kenton W. Murray
Reno Kriz
Celso M. De Melo
Benjamin Van Durme
Rama Chellappa
452
12
0
24 Mar 2025
EgoLife: Towards Egocentric Life Assistant
Computer Vision and Pattern Recognition (CVPR), 2025
Jingkang Yang
Shuai Liu
Hongming Guo
Yuhao Dong
Xinyu Zhang
...
Joerg Widmer
Francesco Gringoli
Lei Yang
Bo Li
Ziwei Liu
EgoV
329
12
0
05 Mar 2025
VG-TVP: Multimodal Procedural Planning via Visually Grounded Text-Video Prompting
AAAI Conference on Artificial Intelligence (AAAI), 2024
Muhammet Furkan Ilaslan
Ali Koksal
Kevin Qinghong Lin
Burak Satar
Mike Zheng Shou
Qianli Xu
LM&Ro
346
3
0
16 Dec 2024
Foundation Models and Adaptive Feature Selection: A Synergistic Approach to Video Question Answering
IEEE Workshop/Winter Conference on Applications of Computer Vision (WACV), 2024
Sai Bhargav Rongali
M. Cui
Ankit Jha
Neha Bhargava
Saurabh Prasad
Biplab Banerjee
298
1
0
12 Dec 2024
Multi-Modal interpretable automatic video captioning
Antoine Hanna-Asaad
Decky Aspandi
Titus Zaharia
278
1
0
11 Nov 2024
ADEM-VL: Adaptive and Embedded Fusion for Efficient Vision-Language Tuning
International Journal of Computer Vision (IJCV), 2024
Zhiwei Hao
Jianyuan Guo
Li Shen
Yong Luo
Han Hu
Yonggang Wen
VLM
334
4
0
23 Oct 2024
LocoMotion: Learning Motion-Focused Video-Language Representations
Asian Conference on Computer Vision (ACCV), 2024
Hazel Doughty
Fida Mohammad Thoker
Cees G. M. Snoek
416
4
0
15 Oct 2024
Prompting Video-Language Foundation Models with Domain-specific Fine-grained Heuristics for Video Question Answering
Ting Yu
Kunhao Fu
Shuhui Wang
Qingming Huang
Jun Yu
338
10
0
12 Oct 2024
Multi-granularity Contrastive Cross-modal Collaborative Generation for End-to-End Long-term Video Question Answering
IEEE Transactions on Image Processing (TIP), 2024
Ting Yu
Kunhao Fu
Jian Zhang
Qingming Huang
Jun Yu
269
11
0
12 Oct 2024
Audio Description Generation in the Era of LLMs and VLMs: A Review of Transferable Generative AI Technologies
North American Chapter of the Association for Computational Linguistics (NAACL), 2024
Yingqiang Gao
Lukas Fischer
Alexa Lintner
Sarah Ebling
280
8
0
11 Oct 2024
T2VIndexer: A Generative Video Indexer for Efficient Text-Video Retrieval
ACM Multimedia (MM), 2024
Yili Li
Jing Yu
Keke Gai
Bang Liu
Gang Xiong
Qi Wu
DiffM
VGen
270
6
0
21 Aug 2024
Causal Understanding For Video Question Answering
Bhanu Prakash Reddy Guda
Tanmay Kulkarni
Adithya Sampath
Swarnashree Mysore Sathyendra
CML
337
0
0
23 Jul 2024
Meta-optimized Angular Margin Contrastive Framework for Video-Language Representation Learning
Thong Nguyen
Yi Bin
Xiaobao Wu
Xinshuai Dong
Zhiyuan Hu
Khoi M. Le
Cong-Duy Nguyen
See-Kiong Ng
Luu Anh Tuan
534
11
0
04 Jul 2024
Towards Holistic Language-video Representation: the language model-enhanced MSR-Video to Text Dataset
Yuchen Yang
Yingxuan Duan
VGen
228
0
0
19 Jun 2024
AutoTVG: A New Vision-language Pre-training Paradigm for Temporal Video Grounding
Xing Zhang
Jiaxi Gu
Haoyu Zhao
Shicong Wang
Hang Xu
Renjing Pei
Songcen Xu
Zuxuan Wu
Yu-Gang Jiang
311
1
0
11 Jun 2024
Video-Language Understanding: A Survey from Model Architecture, Model Training, and Data Perspectives
Annual Meeting of the Association for Computational Linguistics (ACL), 2024
Thong Nguyen
Yi Bin
Junbin Xiao
Leigang Qu
Yicong Li
Jay Zhangjie Wu
Cong-Duy Nguyen
See-Kiong Ng
Luu Anh Tuan
VLM
639
35
1
09 Jun 2024
From CNNs to Transformers in Multimodal Human Action Recognition: A Survey
Muhammad Bilal Shaikh
Syed Mohammed Shamsul Islam
Douglas Chai
Naveed Akhtar
429
36
0
22 May 2024
Unified Video-Language Pre-training with Synchronized Audio
Shentong Mo
Haofan Wang
Huaxia Li
Xu Tang
298
2
0
12 May 2024
Learning text-to-video retrieval from image captioning
Lucas Ventura
Cordelia Schmid
Gül Varol
3DV
417
11
0
26 Apr 2024
Learning Discriminative Spatio-temporal Representations for Semi-supervised Action Recognition
Yu Wang
Sanpin Zhou
Kun Xia
Le Wang
261
3
0
25 Apr 2024
A review of deep learning-based information fusion techniques for multimodal medical image classification
Yi-Hsuan Li
Mostafa EL HABIB DAHO
Pierre-Henri Conze
Rachid Zeghlache
Hugo Le Boité
R. Tadayoni
B. Cochener
M. Lamard
G. Quellec
206
154
0
23 Apr 2024
TextCoT: Zoom In for Enhanced Multimodal Text-Rich Image Understanding
Bozhi Luan
Hao Feng
Hong Chen
Yonghui Wang
Wen-gang Zhou
Houqiang Li
MLLM
327
32
0
15 Apr 2024
VideoDistill: Language-aware Vision Distillation for Video Question Answering
Bo Zou
Chao Yang
Yu Qiao
Chengbin Quan
Youjian Zhao
VGen
265
5
0
01 Apr 2024
Text Is MASS: Modeling as Stochastic Embedding for Text-Video Retrieval
Jiamian Wang
Guohao Sun
Pichao Wang
Dongfang Liu
S. Dianat
Majid Rabbani
Raghuveer M. Rao
Zhiqiang Tao
VGen
470
74
0
26 Mar 2024
InternVideo2: Scaling Video Foundation Models for Multimodal Video Understanding
European Conference on Computer Vision (ECCV), 2024
Yi Wang
Kunchang Li
Xinhao Li
Jiashuo Yu
Yinan He
...
Hongjie Zhang
Yifei Huang
Yu Qiao
Yali Wang
Limin Wang
353
104
0
22 Mar 2024
VURF: A General-purpose Reasoning and Self-refinement Framework for Video Understanding
Ahmad A Mahmood
Ashmal Vayani
Muzammal Naseer
Salman Khan
Fahad Shahbaz Khan
LRM
541
13
0
21 Mar 2024
What's in the Flow? Exploiting Temporal Motion Cues for Unsupervised Generic Event Boundary Detection
Sourabh Vasant Gothe
Vibhav Agarwal
Sourav Ghosh
Jayesh Rajkumar Vachhani
Pranay Kashyap
Barath Raj Kandur
171
5
0
15 Feb 2024
Comment-aided Video-Language Alignment via Contrastive Pre-training for Short-form Video Humor Detection
Yang Liu
Tongfei Shen
Dong Zhang
Qingying Sun
Shoushan Li
Guodong Zhou
291
5
0
14 Feb 2024
M2-RAAP: A Multi-Modal Recipe for Advancing Adaptation-based Pre-training towards Effective and Efficient Zero-shot Video-text Retrieval
Xingning Dong
Zipeng Feng
Chunluan Zhou
Xuzheng Yu
Ming Yang
Qingpei Guo
VLM
291
5
0
31 Jan 2024
SNP-S3: Shared Network Pre-training and Significant Semantic Strengthening for Various Video-Text Tasks
Xingning Dong
Qingpei Guo
Tian Gan
Qing Wang
Yue Yu
Xiangyuan Ren
Yuan Cheng
Wei Chu
248
6
0
31 Jan 2024
Multi-granularity Correspondence Learning from Long-term Noisy Videos
Yijie Lin
Jie Zhang
Zhenyu Huang
Jia-Wei Liu
Zujie Wen
Xi Peng
416
39
0
30 Jan 2024
CoAVT: A Cognition-Inspired Unified Audio-Visual-Text Pre-Training Model for Multimodal Processing
IEEE Transactions on Audio, Speech, and Language Processing (IEEE TASLP), 2024
Xianghu Yue
Xiaohai Tian
Lu Lu
Malu Zhang
Zhizheng Wu
Haizhou Li
338
2
0
22 Jan 2024
DGL: Dynamic Global-Local Prompt Tuning for Text-Video Retrieval
Xiangpeng Yang
Linchao Zhu
Xiaohan Wang
Yi Yang
VLM
333
48
0
19 Jan 2024
Glance and Focus: Memory Prompting for Multi-Event Video Question Answering
Neural Information Processing Systems (NeurIPS), 2024
Ziyi Bai
Ruiping Wang
Xilin Chen
393
15
0
03 Jan 2024
Video Understanding with Large Language Models: A Survey
Yunlong Tang
Jing Bi
Siting Xu
Luchuan Song
Susan Liang
...
Feng Zheng
Jianguo Zhang
Chenliang Xu
Jiebo Luo
Chenliang Xu
VLM
880
216
0
29 Dec 2023
Leveraging Generative Language Models for Weakly Supervised Sentence Component Analysis in Video-Language Joint Learning
Zaber Ibn Abdul Hakim
Najibul Haque Sarker
Rahul Pratap Singh
Bishmoy Paul
Ali Dabouei
Min Xu
379
1
0
10 Dec 2023
Generating Illustrated Instructions
Sachit Menon
Ishan Misra
Rohit Girdhar
DiffM
332
7
0
07 Dec 2023
A Video is Worth 10,000 Words: Training and Benchmarking with Diverse Captions for Better Long Video Retrieval
IEEE Workshop/Winter Conference on Applications of Computer Vision (WACV), 2023
M. Gwilliam
Michael Cogswell
Meng Ye
Karan Sikka
Abhinav Shrivastava
Ajay Divakaran
3DV
321
1
1
30 Nov 2023
A Survey on Multimodal Large Language Models for Autonomous Driving
Can Cui
Yunsheng Ma
Xu Cao
Wenqian Ye
Yang Zhou
...
Xinrui Yan
Shuqi Mei
Jianguo Cao
Ziran Wang
Chao Zheng
397
479
0
21 Nov 2023
ViLMA: A Zero-Shot Benchmark for Linguistic and Temporal Grounding in Video-Language Models
International Conference on Learning Representations (ICLR), 2023
.Ilker Kesen
Andrea Pedrotti
Mustafa Dogan
Michele Cafagna
Emre Can Acikgoz
...
Iacer Calixto
Anette Frank
Albert Gatt
Aykut Erdem
Erkut Erdem
300
23
0
13 Nov 2023
ACQUIRED: A Dataset for Answering Counterfactual Questions In Real-Life Videos
Conference on Empirical Methods in Natural Language Processing (EMNLP), 2023
Te-Lin Wu
Zi-Yi Dou
Qingyuan Hu
Yu Hou
Nischal Reddy Chandra
Marjorie Freedman
R. Weischedel
Nanyun Peng
308
10
0
02 Nov 2023
Harvest Video Foundation Models via Efficient Post-Pretraining
Yizhuo Li
Kunchang Li
Yinan He
Yi Wang
Yali Wang
Limin Wang
Yu Qiao
Ping Luo
CLIP
VLM
VGen
400
3
0
30 Oct 2023
1
2
3
4
5
6
Next
Page 1 of 6