Communities
Connect sessions
AI calendar
Organizations
Join Slack
Contact Sales
Search
Open menu
Home
Papers
1904.01766
Cited By
v1
v2 (latest)
VideoBERT: A Joint Model for Video and Language Representation Learning
3 April 2019
Chen Sun
Austin Myers
Carl Vondrick
Kevin Patrick Murphy
Cordelia Schmid
VLM
SSL
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"VideoBERT: A Joint Model for Video and Language Representation Learning"
50 / 803 papers shown
RA-Rec: An Efficient ID Representation Alignment Framework for LLM-based Recommendation
Xiaohan Yu
Li Zhang
Xin Zhao
Yue Wang
Zhongrui Ma
161
14
0
07 Feb 2024
M2-RAAP: A Multi-Modal Recipe for Advancing Adaptation-based Pre-training towards Effective and Efficient Zero-shot Video-text Retrieval
Xingning Dong
Zipeng Feng
Chunluan Zhou
Xuzheng Yu
Ming Yang
Qingpei Guo
VLM
254
5
0
31 Jan 2024
Towards Urban General Intelligence: A Review and Outlook of Urban Foundation Models
Weijiao Zhang
Jindong Han
Zhao Xu
Hang Ni
Hao Liu
Hui Xiong
Hui Xiong
AI4CE
554
26
0
30 Jan 2024
Cross-Modal Coordination Across a Diverse Set of Input Modalities
Jorge Sánchez
Rodrigo Laguna
VLM
238
0
0
29 Jan 2024
Dynamic Transformer Architecture for Continual Learning of Multimodal Tasks
Yuliang Cai
Mohammad Rostami
357
6
0
27 Jan 2024
Multimodal Pathway: Improve Transformers with Irrelevant Data from Other Modalities
Computer Vision and Pattern Recognition (CVPR), 2024
Yiyuan Zhang
Xiaohan Ding
Kaixiong Gong
Yixiao Ge
Ying Shan
Xiangyu Yue
ViT
312
11
0
25 Jan 2024
CoAVT: A Cognition-Inspired Unified Audio-Visual-Text Pre-Training Model for Multimodal Processing
IEEE Transactions on Audio, Speech, and Language Processing (IEEE TASLP), 2024
Xianghu Yue
Xiaohai Tian
Lu Lu
Malu Zhang
Zhizheng Wu
Haizhou Li
235
1
0
22 Jan 2024
DGL: Dynamic Global-Local Prompt Tuning for Text-Video Retrieval
Xiangpeng Yang
Linchao Zhu
Xiaohan Wang
Yi Yang
VLM
296
42
0
19 Jan 2024
Collaboratively Self-supervised Video Representation Learning for Action Recognition
IEEE Transactions on Information Forensics and Security (IEEE TIFS), 2024
Jie Zhang
Zhifan Wan
Lanqing Hu
Stephen Lin
Shuzhe Wu
Shiguang Shan
TTA
376
2
0
15 Jan 2024
Video Understanding with Large Language Models: A Survey
Yunlong Tang
Jing Bi
Siting Xu
Luchuan Song
Susan Liang
...
Feng Zheng
Jianguo Zhang
Chenliang Xu
Jiebo Luo
Chenliang Xu
VLM
712
163
0
29 Dec 2023
Data-Efficient Multimodal Fusion on a Single GPU
Computer Vision and Pattern Recognition (CVPR), 2023
Noël Vouitsis
Zhaoyan Liu
S. Gorti
Valentin Villecroze
Jesse C. Cresswell
Guangwei Yu
Gabriel Loaiza-Ganem
Anthony L. Caterini
460
8
0
15 Dec 2023
Audio-Visual LLM for Video Understanding
Fangxun Shu
Lei Zhang
Hao Jiang
Cihang Xie
VLM
MLLM
240
66
0
11 Dec 2023
A Video is Worth 10,000 Words: Training and Benchmarking with Diverse Captions for Better Long Video Retrieval
IEEE Workshop/Winter Conference on Applications of Computer Vision (WACV), 2023
M. Gwilliam
Michael Cogswell
Meng Ye
Karan Sikka
Abhinav Shrivastava
Ajay Divakaran
3DV
288
1
1
30 Nov 2023
Spacewalk-18: A Benchmark for Multimodal and Long-form Procedural Video Understanding in Novel Domains
Rohan Myer Krishnan
Zitian Tang
Zhiqiu Yu
Chen Sun
496
2
0
30 Nov 2023
E-ViLM: Efficient Video-Language Model via Masked Video Modeling with Semantic Vector-Quantized Tokenizer
Jacob Zhiyuan Fang
Skyler Zheng
Vasu Sharma
Robinson Piramuthu
VLM
392
1
0
28 Nov 2023
Mitigating Object Hallucinations in Large Vision-Language Models through Visual Contrastive Decoding
Computer Vision and Pattern Recognition (CVPR), 2023
Sicong Leng
Hang Zhang
Guanzheng Chen
Xin Li
Shijian Lu
Chunyan Miao
Li Bing
VLM
MLLM
311
442
0
28 Nov 2023
Vamos: Versatile Action Models for Video Understanding
European Conference on Computer Vision (ECCV), 2023
Shijie Wang
Qi Zhao
Minh Quan Do
Nakul Agarwal
Kwonjoon Lee
Chen Sun
389
35
0
22 Nov 2023
SPOT! Revisiting Video-Language Models for Event Understanding
Gengyuan Zhang
Jinhe Bi
Jindong Gu
Yanyu Chen
Volker Tresp
443
14
0
21 Nov 2023
Advancing Drug Discovery with Enhanced Chemical Understanding via Asymmetric Contrastive Multimodal Learning
Journal of Chemical Information and Modeling (JCIM), 2023
Hao Xu
Yifei Wang
Yunrui Li
Pengyu Hong
Pengyu Hong
371
1
0
11 Nov 2023
Towards A Unified Neural Architecture for Visual Recognition and Reasoning
Calvin Luo
Boqing Gong
Ting Chen
Chen Sun
OCL
ObjD
163
1
0
10 Nov 2023
CLearViD: Curriculum Learning for Video Description
Cheng-Yu Chuang
Pooyan Fazli
152
1
0
08 Nov 2023
A Single 2D Pose with Context is Worth Hundreds for 3D Human Pose Estimation
Neural Information Processing Systems (NeurIPS), 2023
Qi-jun Zhao
Ce Zheng
Mengyuan Liu
Chong Chen
231
24
0
06 Nov 2023
ACQUIRED: A Dataset for Answering Counterfactual Questions In Real-Life Videos
Conference on Empirical Methods in Natural Language Processing (EMNLP), 2023
Te-Lin Wu
Zi-Yi Dou
Qingyuan Hu
Yu Hou
Nischal Reddy Chandra
Marjorie Freedman
R. Weischedel
Nanyun Peng
282
9
0
02 Nov 2023
Object-centric Video Representation for Long-term Action Anticipation
IEEE Workshop/Winter Conference on Applications of Computer Vision (WACV), 2023
Ce Zhang
Changcheng Fu
Shijie Wang
Nakul Agarwal
Kwonjoon Lee
Chiho Choi
Chen Sun
279
29
0
31 Oct 2023
Harvest Video Foundation Models via Efficient Post-Pretraining
Yizhuo Li
Kunchang Li
Yinan He
Yi Wang
Yali Wang
Limin Wang
Yu Qiao
Ping Luo
CLIP
VLM
VGen
350
3
0
30 Oct 2023
Generating Context-Aware Natural Answers for Questions in 3D Scenes
British Machine Vision Conference (BMVC), 2023
Mohammed Munzer Dwedari
Matthias Niessner
Dave Zhenyu Chen
194
4
0
30 Oct 2023
MOSEL: Inference Serving Using Dynamic Modality Selection
Conference on Empirical Methods in Natural Language Processing (EMNLP), 2023
Bodun Hu
Le Xu
Jeongyoon Moon
N. Yadwadkar
Aditya Akella
300
5
0
27 Oct 2023
ArchBERT: Bi-Modal Understanding of Neural Architectures and Natural Languages
Conference on Computational Natural Language Learning (CoNLL), 2023
Mohammad Akbari
Saeed Ranjbar Alvar
Behnam Kamranian
Amin Banitalebi-Dehkordi
Yong Zhang
AI4CE
139
2
0
26 Oct 2023
Exploring Iterative Refinement with Diffusion Models for Video Grounding
IEEE International Conference on Multimedia and Expo (ICME), 2023
Xiao Liang
Tao Shi
Yaoyuan Liang
Te Tao
Shao-Lun Huang
DiffM
267
2
0
26 Oct 2023
CAD -- Contextual Multi-modal Alignment for Dynamic AVQA
IEEE Workshop/Winter Conference on Applications of Computer Vision (WACV), 2023
Asmar Nadeem
Adrian Hilton
R. Dawes
Graham A. Thomas
A. Mustafa
302
14
0
25 Oct 2023
FloCoDe: Unbiased Dynamic Scene Graph Generation with Temporal Consistency and Correlation Debiasing
Anant Khandelwal
459
2
0
24 Oct 2023
UNK-VQA: A Dataset and a Probe into the Abstention Ability of Multi-modal Large Models
IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2023
Yanyang Guo
Fangkai Jiao
Zhiqi Shen
Liqiang Nie
Mohan S. Kankanhalli
MLLM
389
13
0
17 Oct 2023
GePSAn: Generative Procedure Step Anticipation in Cooking Videos
IEEE International Conference on Computer Vision (ICCV), 2023
M. A. Abdelsalam
Samrudhdhi B. Rangrej
Isma Hadji
Nikita Dvornik
Konstantinos G. Derpanis
Afsaneh Fazly
AI4TS
217
8
0
12 Oct 2023
Latent Wander: an Alternative Interface for Interactive and Serendipitous Discovery of Large AV Archives
Yuchen Yang
Linyida Zhang
219
2
0
09 Oct 2023
HowToCaption: Prompting LLMs to Transform Video Annotations at Scale
European Conference on Computer Vision (ECCV), 2023
Nina Shvetsova
Anna Kukleva
Xudong Hong
Christian Rupprecht
Bernt Schiele
Hilde Kuehne
297
31
0
07 Oct 2023
CLEVRER-Humans: Describing Physical and Causal Events the Human Way
Neural Information Processing Systems (NeurIPS), 2023
Jiayuan Mao
Xuelin Yang
Xikun Zhang
Noah D. Goodman
Jiajun Wu
NAI
333
22
0
05 Oct 2023
GRID: A Platform for General Robot Intelligence Development
Sai H. Vemprala
Shuhang Chen
Abhinav Shukla
Dinesh Narayanan
Ashish Kapoor
271
11
0
02 Oct 2023
Skip-Plan: Procedure Planning in Instructional Videos via Condensed Action Space Learning
IEEE International Conference on Computer Vision (ICCV), 2023
Zhiheng Li
Wenjia Geng
Muheng Li
Lei Chen
Yansong Tang
Jiwen Lu
Jie Zhou
178
12
0
01 Oct 2023
PROSE: Predicting Operators and Symbolic Expressions using Multimodal Transformers
Yuxuan Liu
Zecheng Zhang
Hayden Schaeffer
211
22
0
28 Sep 2023
Social Media Fashion Knowledge Extraction as Captioning
Yifei Yuan
Wenxuan Zhang
Yang Deng
Wai Lam
180
2
0
28 Sep 2023
ELIP: Efficient Language-Image Pre-training with Fewer Vision Tokens
Yangyang Guo
Haoyu Zhang
Yongkang Wong
Liqiang Nie
Mohan Kankanhalli
VLM
238
5
0
28 Sep 2023
Zero-Shot and Few-Shot Video Question Answering with Multi-Modal Prompts
Bipin Rajendran
Bashir M. Al-Hashimi
MLLM
VLM
253
8
0
27 Sep 2023
VidChapters-7M: Video Chapters at Scale
Neural Information Processing Systems (NeurIPS), 2023
Antoine Yang
Arsha Nagrani
Ivan Laptev
Josef Sivic
Cordelia Schmid
VGen
246
39
0
25 Sep 2023
Revisiting Kernel Temporal Segmentation as an Adaptive Tokenizer for Long-form Video Understanding
Mohamed Afham
Satya Narayan Shukla
Omid Poursaeed
Pengchuan Zhang
Ashish Shah
Sernam Lim
VLM
195
4
0
20 Sep 2023
Dual-Modal Attention-Enhanced Text-Video Retrieval with Triplet Partial Margin Contrastive Learning
ACM Multimedia (ACM MM), 2023
Chen Jiang
Hong Liu
Xuzheng Yu
Qing Wang
Yuan Cheng
...
Zhongyi Liu
Qingpei Guo
Wei Chu
Ming-Hsuan Yang
Yuan Qi
366
17
0
20 Sep 2023
Collaborative Three-Stream Transformers for Video Captioning
Computer Vision and Image Understanding (CVIU), 2023
Hao Wang
Libo Zhang
Hengrui Fan
Tiejian Luo
193
8
0
18 Sep 2023
Disentangling Spatial and Temporal Learning for Efficient Image-to-Video Transfer Learning
IEEE International Conference on Computer Vision (ICCV), 2023
Zhiwu Qing
Shiwei Zhang
Ziyuan Huang
Yingya Zhang
Changxin Gao
Deli Zhao
Nong Sang
216
31
0
14 Sep 2023
Unified Pre-training with Pseudo Texts for Text-To-Image Person Re-identification
IEEE International Conference on Computer Vision (ICCV), 2023
Zhiyin Shao
Xinyu Zhang
Changxing Ding
Jian Wang
Jingdong Wang
239
38
0
04 Sep 2023
COMEDIAN: Self-Supervised Learning and Knowledge Distillation for Action Spotting using Transformers
J. Denize
Mykola Liashuha
Jaonary Rabarisoa
Astrid Orcesi
Romain Hérault
ViT
289
19
0
03 Sep 2023
A Fine-Grained Image Description Generation Method Based on Joint Objectives
Chinese Conference on Computer Supported Cooperative Work and Social Computing (SCWSC), 2023
Yifan Zhang
Chunzhen Lin
Donglin Cao
Dazhen Lin
EGVM
123
0
0
02 Sep 2023
Previous
1
2
3
4
5
...
15
16
17
Next