Communities
Connect sessions
AI calendar
Organizations
Join Slack
Contact Sales
Search
Open menu
Home
Papers
1904.01766
Cited By
v1
v2 (latest)
VideoBERT: A Joint Model for Video and Language Representation Learning
3 April 2019
Chen Sun
Austin Myers
Carl Vondrick
Kevin Patrick Murphy
Cordelia Schmid
VLM
SSL
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"VideoBERT: A Joint Model for Video and Language Representation Learning"
50 / 802 papers shown
Title
EEA: Exploration-Exploitation Agent for Long Video Understanding
Te Yang
Xiangyu Zhu
Bo Wang
Quan Chen
Peng Jiang
Zhen Lei
56
0
0
03 Dec 2025
Spatio-Temporal Data Enhanced Vision-Language Model for Traffic Scene Understanding
JingTian Ma
Jingyuan Wang
Wayne Xin Zhao
Guoping Liu
Xiang Wen
CLIP
48
0
0
12 Nov 2025
Semantic-Guided Natural Language and Visual Fusion for Cross-Modal Interaction Based on Tiny Object Detection
Xian-Hong Huang
Hui-Kai Su
Chi-Chia Sun
Jun-Wei Hsieh
ObjD
388
0
0
07 Nov 2025
MVAFormer: RGB-based Multi-View Spatio-Temporal Action Recognition with Transformer
International Conference on Information Photonics (ICIP), 2024
Taiga Yamane
Satoshi Suzuki
Ryo Masumura
Shotaro Tora
96
1
0
04 Nov 2025
FOCUS: Efficient Keyframe Selection for Long Video Understanding
Zirui Zhu
Hailun Xu
Yang Luo
Yong Liu
Kanchan Sarkar
Zhenheng Yang
Yang You
152
0
0
31 Oct 2025
Masked Diffusion Captioning for Visual Feature Learning
Chao Feng
Zihao Wei
Andrew Owens
DiffM
231
0
0
30 Oct 2025
MUStReason: A Benchmark for Diagnosing Pragmatic Reasoning in Video-LMs for Multimodal Sarcasm Detection
Anisha Saha
Varsha Suresh
Timothy Hospedales
Vera Demberg
LRM
77
0
0
27 Oct 2025
Modest-Align: Data-Efficient Alignment for Vision-Language Models
Jiaxiang Liu
Yuan Wang
Jiawei Du
Joey Tianyi Zhou
Mingkun Xu
Zuozhu Liu
VLM
116
0
0
24 Oct 2025
Multimedia-Aware Question Answering: A Review of Retrieval and Cross-Modal Reasoning Architectures
Rahul Raja
A. Vats
143
1
0
23 Oct 2025
MaskCaptioner: Learning to Jointly Segment and Caption Object Trajectories in Videos
Gabriel Fiastre
Antoine Yang
Cordelia Schmid
VOS
426
0
0
16 Oct 2025
Prompt-Guided Spatial Understanding with RGB-D Transformers for Fine-Grained Object Relation Reasoning
Tanner Muturi
Blessing Agyei Kyem
Joshua Kofi Asamoah
Neema Jakisa Owor
Richard Dyzinela
Andrews Danyo
Y. Adu-Gyamfi
Armstrong Aboah
LRM
121
3
0
13 Oct 2025
AVoCaDO: An Audiovisual Video Captioner Driven by Temporal Orchestration
Xinlong Chen
Yue Ding
Weihong Lin
Jingyun Hua
Linli Yao
...
Yuanxing Zhang
Qiang Liu
Pengfei Wan
Liang Wang
Tieniu Tan
241
2
0
12 Oct 2025
Expressive and Scalable Quantum Fusion for Multimodal Learning
T. Nguyen
Trong Nghia Hoang
Phi Le Nguyen
Hai L. Vu
Truong Cong Thang
130
0
0
08 Oct 2025
Video-LMM Post-Training: A Deep Dive into Video Reasoning with Large Multimodal Models
Yunlong Tang
Jing Bi
Pinxin Liu
Zhenyu Pan
Mingqian Feng
...
Zeliang Zhang
Daiki Shimada
Han Liu
Jiebo Luo
Chenliang Xu
MLLM
OffRL
VLM
LRM
686
8
0
06 Oct 2025
ProfVLM: A Lightweight Video-Language Model for Multi-View Proficiency Estimation
Edoardo Bianchi
Jacopo Staiano
Antonio Liotta
VLM
107
0
0
30 Sep 2025
iFinder: Structured Zero-Shot Vision-Based LLM Grounding for Dash-Cam Video Reasoning
Manyi Yao
Bingbing Zhuang
Sparsh Garg
Amit Roy-Chowdhury
Christian Shelton
Manmohan Chandraker
Abhishek Aich
LRM
174
1
0
23 Sep 2025
Prompt-Driven Agentic Video Editing System: Autonomous Comprehension of Long-Form, Story-Driven Media
Zihan Ding
Junlong Chen
Per Ola Kristensson
Junxiao Shen
Xinyi Wang
VGen
208
0
0
20 Sep 2025
MVQA-68K: A Multi-dimensional and Causally-annotated Dataset with Quality Interpretability for Video Assessment
Yanyun Pu
Kehan Li
Zeyi Huang
Zhijie Zhong
Kaixiang Yang
VGen
88
0
0
15 Sep 2025
Video Understanding by Design: How Datasets Shape Architectures and Insights
Lei Wang
Piotr Koniusz
Yongsheng Gao
3DV
VGen
AI4TS
233
0
0
11 Sep 2025
Towards Open-Vocabulary Multimodal 3D Object Detection with Attributes
Xinhao Xiang
Kuan-Chuan Peng
Suhas Lohit
Michael Jeffrey Jones
Jiawei Zhang
3DPC
154
1
0
22 Aug 2025
TrajSV: A Trajectory-based Model for Sports Video Representations and Applications
Zheng Wang
Shihao Xu
Wei Shi
104
0
0
15 Aug 2025
Adversarial Video Promotion Against Text-to-Video Retrieval
Qiwei Tian
Chenhao Lin
Zhengyu Zhao
Qian Li
Shuai Liu
Chao Shen
AAML
155
0
0
09 Aug 2025
IntentVCNet: Bridging Spatio-Temporal Gaps for Intention-Oriented Controllable Video Captioning
Tianheng Qiu
Jingchun Gao
Jingyu Li
Huiyi Leong
Xuan Huang
Xi Wang
Xiaocheng Zhang
K. Xu
Lan Zhang
161
8
0
24 Jul 2025
EVOLVE-X: Embedding Fusion and Language Prompting for User Evolution Forecasting on Social Media
Ismail Hossain
Sai Puppala
Md. jahangir Alam
Sajedul Talukder
78
0
0
21 Jul 2025
Identify, Isolate, and Purge: Mitigating Hallucinations in LVLMs via Self-Evolving Distillation
Wenhao Li
Xiu Su
Jingyi Wu
Feng Yang
Yang-Yang Liu
Yi-Ling Chen
Shan You
Chang Xu
VLM
207
0
0
07 Jul 2025
Large Language Models for Crash Detection in Video: A Survey of Methods, Datasets, and Challenges
Sanjeda Akter
Ibne Farabi Shihab
Anuj Sharma
VLM
293
2
0
02 Jul 2025
ActAlign: Zero-Shot Fine-Grained Video Classification via Language-Guided Sequence Alignment
Amir Aghdam
Vincent Tao Hu
Bjorn Ommer
VLM
255
2
0
28 Jun 2025
Language-driven Description Generation and Common Sense Reasoning for Video Action Recognition
Xiaodan Hu
Chuhang Zou
Suchen Wang
Jaechul Kim
Narendra Ahuja
LRM
167
0
0
20 Jun 2025
Vision Generalist Model: A Survey
International Journal of Computer Vision (IJCV), 2025
Ziyi Wang
Yongming Rao
Shuofeng Sun
Xinrun Liu
Yi Wei
...
Zuyan Liu
Yanbo Wang
Hongmin Liu
Jie Zhou
Jiwen Lu
281
0
0
11 Jun 2025
Video-Skill-CoT: Skill-based Chain-of-Thoughts for Domain-Adaptive Video Reasoning
Daeun Lee
Jaehong Yoon
Jaemin Cho
Mohit Bansal
LRM
304
2
0
04 Jun 2025
CogniAlign: Word-Level Multimodal Speech Alignment with Gated Cross-Attention for Alzheimer's Detection
Knowledge-Based Systems (KBS), 2025
David Ortiz-Perez
Manuel Benavent-Lledo
Javier Rodriguez-Juan
José García Rodríguez
David Tomás
287
4
0
02 Jun 2025
Mitigating Hallucinations via Inter-Layer Consistency Aggregation in Large Vision-Language Models
Kai Tang
Jinhao You
Xiuqi Ge
Hanze Li
Yichen Guo
Xiande Huang
MLLM
463
3
0
18 May 2025
Towards Artificial General or Personalized Intelligence? A Survey on Foundation Models for Personalized Federated Intelligence
Yu Qiao
Huy Q. Le
Avi Deb Raha
Phuong-Nam Tran
Apurba Adhikary
Mengchun Zhang
Loc X. Nguyen
Eui-nam Huh
Zhu Han
Choong Seon Hong
AI4CE
379
5
0
11 May 2025
LightEMMA: Lightweight End-to-End Multimodal Model for Autonomous Driving
Zhijie Qiao
Haowei Li
Zhong Cao
Henry X. Liu
VLM
436
48
0
01 May 2025
HierSum: A Global and Local Attention Mechanism for Video Summarization
Apoorva Beedu
Irfan Essa
829
0
0
25 Apr 2025
Towards Explainable AI: Multi-Modal Transformer for Video-based Image Description Generation
Lakshita Agarwal
Bindu Verma
ViT
152
0
0
23 Apr 2025
RadarLLM: Empowering Large Language Models to Understand Human Motion from Millimeter-Wave Point Cloud Sequence
Zengyuan Lai
Jiarui Yang
Songpengcheng Xia
Lizhou Lin
Lan Sun
Renwen Wang
Qingbin Liu
Qi Wu
Ling Pei
282
1
0
14 Apr 2025
SF2T: Self-supervised Fragment Finetuning of Video-LLMs for Fine-Grained Understanding
Computer Vision and Pattern Recognition (CVPR), 2025
Yangliu Hu
Zikai Song
Na Feng
Yawei Luo
Junqing Yu
Yi-Ping Phoebe Chen
Wei Yang
173
11
0
10 Apr 2025
LVC: A Lightweight Compression Framework for Enhancing VLMs in Long Video Understanding
Ziyi Wang
Haoran Wu
Yiming Rong
Deyang Jiang
Yixin Zhang
Yue Zhao
Shuang Xu
Bo Xu
VLM
201
3
0
09 Apr 2025
TC-MGC: Text-Conditioned Multi-Grained Contrastive Learning for Text-Video Retrieval
Information Fusion (Inf. Fusion), 2025
Xiaolun Jing
Genke Yang
Jian Chu
222
3
0
07 Apr 2025
Safety Modulation: Enhancing Safety in Reinforcement Learning through Cost-Modulated Rewards
Hanping Zhang
Yuhong Guo
OffRL
276
2
0
03 Apr 2025
Is Temporal Prompting All We Need For Limited Labeled Action Recognition?
Shreyank N. Gowda
Boyan Gao
Xiao Gu
Xiaobo Jin
VLM
343
0
0
02 Apr 2025
PolygoNet: Leveraging Simplified Polygonal Representation for Effective Image Classification
Salim Khazem
Jérémy Fix
C´edric Pradalier
142
3
0
01 Apr 2025
Chapter-Llama: Efficient Chaptering in Hour-Long Videos with LLMs
Computer Vision and Pattern Recognition (CVPR), 2025
Lucas Ventura
Antoine Yang
Cordelia Schmid
Gül Varol
262
4
0
31 Mar 2025
Fair Dynamic Spectrum Access via Fully Decentralized Multi-Agent Reinforcement Learning
International Symposium on Modeling and Optimization in Mobile, Ad-Hoc and Wireless Networks (WiOpt), 2025
Yubo Zhang
Pedro Botelho
Trevor Gordon
Gil Zussman
I. Kadota
273
1
0
31 Mar 2025
Understanding Co-speech Gestures in-the-wild
Sindhu B. Hegde
KR Prajwal
Taein Kwon
Andrew Zisserman
SLR
351
2
0
28 Mar 2025
Unbiasing through Textual Descriptions: Mitigating Representation Bias in Video Benchmarks
Computer Vision and Pattern Recognition (CVPR), 2025
Nina Shvetsova
Arsha Nagrani
Bernt Schiele
Hilde Kuehne
Christian Rupprecht
250
1
0
24 Mar 2025
Video-ColBERT: Contextualized Late Interaction for Text-to-Video Retrieval
Computer Vision and Pattern Recognition (CVPR), 2025
Arun V. Reddy
Alexander Martin
Eugene Yang
Andrew Yates
Kate Sanders
Kenton W. Murray
Reno Kriz
Celso M. De Melo
Benjamin Van Durme
Rama Chellappa
312
9
0
24 Mar 2025
Video-VoT-R1: An efficient video inference model integrating image packing and AoE architecture
Cheng Li
Jiexiong Liu
Yixuan Chen
Yanqin Jia
MLLM
VLM
251
2
0
20 Mar 2025
RFMI: Estimating Mutual Information on Rectified Flow for Text-to-Image Alignment
Chao Wang
Giulio Franzese
A. Finamore
Pietro Michiardi
430
7
0
18 Mar 2025
1
2
3
4
...
15
16
17
Next