Communities
Connect sessions
AI calendar
Organizations
Join Slack
Contact Sales
Search
Open menu
Home
Papers
1904.01766
Cited By
v1
v2 (latest)
VideoBERT: A Joint Model for Video and Language Representation Learning
3 April 2019
Chen Sun
Austin Myers
Carl Vondrick
Kevin Patrick Murphy
Cordelia Schmid
VLM
SSL
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"VideoBERT: A Joint Model for Video and Language Representation Learning"
50 / 803 papers shown
Large Content And Behavior Models To Understand, Simulate, And Optimize Content And Behavior
International Conference on Learning Representations (ICLR), 2023
Ashmit Khandelwal
Aditya Agrawal
Aanisha Bhattacharyya
Yaman Kumar Singla
Somesh Singh
...
Ishita Dasgupta
Stefano Petrangeli
R. Shah
Changyou Chen
Balaji Krishnamurthy
342
10
0
01 Sep 2023
IndGIC: Supervised Action Recognition under Low Illumination
Jing-Teng Zeng
186
3
0
29 Aug 2023
A Multi-Task Semantic Decomposition Framework with Task-specific Pre-training for Few-Shot NER
International Conference on Information and Knowledge Management (CIKM), 2023
Guanting Dong
Zechen Wang
Jinxu Zhao
Gang Zhao
Daichi Guo
...
Keqing He
Xuefeng Li
Liwen Wang
Xinyue Cui
Weiran Xu
216
23
0
28 Aug 2023
Chunk, Align, Select: A Simple Long-sequence Processing Method for Transformers
Annual Meeting of the Association for Computational Linguistics (ACL), 2023
Jiawen Xie
Pengyu Cheng
Xiao Liang
Yong Dai
Nan Du
290
15
0
25 Aug 2023
Multi-event Video-Text Retrieval
IEEE International Conference on Computer Vision (ICCV), 2023
Gengyuan Zhang
Jisen Ren
Jindong Gu
Volker Tresp
193
18
0
22 Aug 2023
MusicJam: Visualizing Music Insights via Generated Narrative Illustrations
Communications in Information and Systems (CIS), 2023
Chuer Chen
Nan Cao
Jiani Hou
Yi Guo
Yulei Zhang
Yang Shi
DiffM
200
1
0
22 Aug 2023
Simple Baselines for Interactive Video Retrieval with Questions and Answers
IEEE International Conference on Computer Vision (ICCV), 2023
Kaiqu Liang
Samuel Albanie
200
8
0
21 Aug 2023
Long-range Multimodal Pretraining for Movie Understanding
IEEE International Conference on Computer Vision (ICCV), 2023
Dawit Mureja Argaw
Joon-Young Lee
Markus Woodson
In So Kweon
Fabian Caba Heilbron
VLM
189
14
0
18 Aug 2023
Lip Reading for Low-resource Languages by Learning and Combining General Speech Knowledge and Language-specific Knowledge
IEEE International Conference on Computer Vision (ICCV), 2023
Minsu Kim
Jeong Hun Yeo
J. Choi
Y. Ro
209
27
0
18 Aug 2023
Diffusion Models for Image Restoration and Enhancement: A Comprehensive Survey
International Journal of Computer Vision (IJCV), 2023
Xin Li
Yulin Ren
Xin Jin
Cuiling Lan
Xingyu Wang
Wenjun Zeng
Xinchao Wang
Zhibo Chen
369
139
0
18 Aug 2023
BERT4CTR: An Efficient Framework to Combine Pre-trained Language Model with Non-textual Features for CTR Prediction
Knowledge Discovery and Data Mining (KDD), 2023
Dong Wang
Kave Salamatian
Yunqing Xia
Weiwei Deng
Qi Zhang
151
22
0
17 Aug 2023
Tem-adapter: Adapting Image-Text Pretraining for Video Question Answer
IEEE International Conference on Computer Vision (ICCV), 2023
Guangyi Chen
Xiao Liu
Guangrun Wang
Kun Zhang
Philip H.S.Torr
Xiaoping Zhang
Yansong Tang
293
27
0
16 Aug 2023
AKVSR: Audio Knowledge Empowered Visual Speech Recognition by Compressing Audio Knowledge of a Pretrained Model
IEEE transactions on multimedia (IEEE TMM), 2023
Jeong Hun Yeo
Minsu Kim
J. Choi
Dae Hoe Kim
Y. Ro
187
26
0
15 Aug 2023
Cross-Domain Product Representation Learning for Rich-Content E-Commerce
IEEE International Conference on Computer Vision (ICCV), 2023
Xuehan Bai
Yan Li
Yong Cheng
Wenjie Yang
Quanming Chen
Han Li
169
7
0
10 Aug 2023
MovieChat: From Dense Token to Sparse Memory for Long Video Understanding
Computer Vision and Pattern Recognition (CVPR), 2023
Enxin Song
Wenhao Chai
Guanhong Wang
Yucheng Zhang
Haoyang Zhou
...
Tianbo Ye
Yanting Zhang
Yang Lu
Lei Li
Gaoang Wang
VLM
MLLM
620
453
0
31 Jul 2023
AntGPT: Can Large Language Models Help Long-term Action Anticipation from Videos?
International Conference on Learning Representations (ICLR), 2023
Qi Zhao
Shijie Wang
Ce Zhang
Changcheng Fu
Minh Quan Do
Nakul Agarwal
Kwonjoon Lee
Chen Sun
LM&Ro
388
81
0
31 Jul 2023
FedMEKT: Distillation-based Embedding Knowledge Transfer for Multimodal Federated Learning
Neural Networks (Neural Netw.), 2023
Huy Q. Le
Minh N. H. Nguyen
Chu Myaet Thwal
Yu Qiao
Chao Zhang
Choong Seon Hong
162
26
0
25 Jul 2023
Does Visual Pretraining Help End-to-End Reasoning?
Neural Information Processing Systems (NeurIPS), 2023
Chen Sun
Calvin Luo
Xingyi Zhou
Anurag Arnab
Cordelia Schmid
OCL
LRM
ViT
322
4
0
17 Jul 2023
InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation
International Conference on Learning Representations (ICLR), 2023
Yi Wang
Yinan He
Yizhuo Li
Kunchang Li
Jiashuo Yu
...
Ping Luo
Ziwei Liu
Yali Wang
Limin Wang
Yu Qiao
VLM
VGen
364
405
0
13 Jul 2023
EgoVLPv2: Egocentric Video-Language Pre-training with Fusion in the Backbone
IEEE International Conference on Computer Vision (ICCV), 2023
Shraman Pramanick
Yale Song
Sayan Nag
Kevin Qinghong Lin
Hardik Shah
Mike Zheng Shou
Ramalingam Chellappa
Pengchuan Zhang
VLM
343
133
0
11 Jul 2023
One-Versus-Others Attention: Scalable Multimodal Integration for Clinical Data
Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing (PSB), 2023
Michal Golovanevsky
Eva Schiller
Akira Nair
Ritambhara Singh
Carsten Eickhoff
330
7
0
11 Jul 2023
An Exploratory Literature Study on Sharing and Energy Use of Language Models for Source Code
International Symposium on Empirical Software Engineering and Measurement (ESEM), 2023
Max Hort
Anastasiia Grishina
Leon Moonen
245
8
0
05 Jul 2023
S-Omninet: Structured Data Enhanced Universal Multimodal Learning Architecture
Ye Xue
Diego Klabjan
J. Utke
94
0
0
01 Jul 2023
Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning
International Conference on Learning Representations (ICLR), 2023
Fuxiao Liu
Kevin Qinghong Lin
Linjie Li
Jianfeng Wang
Yaser Yacoob
Lijuan Wang
VLM
MLLM
427
404
0
26 Jun 2023
Switch-BERT: Learning to Model Multimodal Interactions by Switching Attention and Input
European Conference on Computer Vision (ECCV), 2023
Qingpei Guo
Kaisheng Yao
Wei Chu
MLLM
103
6
0
25 Jun 2023
Exploring the Role of Audio in Video Captioning
Yuhan Shen
Linjie Yang
Longyin Wen
Haichao Yu
Ehsan Elhamifar
Heng Wang
168
6
0
21 Jun 2023
Retrieving-to-Answer: Zero-Shot Video Question Answering with Frozen Large Language Models
Junting Pan
Ziyi Lin
Yuying Ge
Xiatian Zhu
Renrui Zhang
Yi Wang
Yu Qiao
Jiaming Song
MLLM
177
35
0
15 Jun 2023
Better Generalization with Semantic IDs: A Case Study in Ranking for Recommendations
ACM Conference on Recommender Systems (RecSys), 2023
Anima Singh
Trung Vu
Nikhil Mehta
Raghunandan H. Keshavan
M. Sathiamoorthy
...
Lukasz Heldt
Li Wei
Devansh Tandon
Ed H. Chi
Xinyang Yi
237
56
0
13 Jun 2023
A Survey of Vision-Language Pre-training from the Lens of Multimodal Machine Translation
Jeremy Gwinnup
Kevin Duh
VLM
148
7
0
12 Jun 2023
CD-CTFM: A Lightweight CNN-Transformer Network for Remote Sensing Cloud Detection Fusing Multiscale Features
IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing (JSTARS), 2023
Wenhang Ge
Xubing Yang
Li Zhang
184
23
0
12 Jun 2023
Optimizing ViViT Training: Time and Memory Reduction for Action Recognition
Shreyank N. Gowda
Anurag Arnab
Jonathan Huang
ViT
182
4
0
07 Jun 2023
Object Detection with Transformers: A Review
Italian National Conference on Sensors (INS), 2023
Tahira Shehzadi
K. Hashmi
D. Stricker
Muhammad Zeshan Afzal
ViT
MU
418
53
0
07 Jun 2023
Learning to Ground Instructional Articles in Videos through Narrations
IEEE International Conference on Computer Vision (ICCV), 2023
E. Mavroudi
Triantafyllos Afouras
Lorenzo Torresani
DiffM
217
27
0
06 Jun 2023
LANISTR: Multimodal Learning from Structured and Unstructured Data
Sayna Ebrahimi
Sercan O. Arik
Yihe Dong
Tomas Pfister
237
7
0
26 May 2023
Denoising Bottleneck with Mutual Information Maximization for Video Multimodal Fusion
Annual Meeting of the Association for Computational Linguistics (ACL), 2023
Shao-Yu Wu
Damai Dai
Ziwei Qin
Tianyu Liu
Binghuai Lin
Yunbo Cao
Zhifang Sui
306
17
0
24 May 2023
Exploring Affordance and Situated Meaning in Image Captions: A Multimodal Analysis
Pacific Asia Conference on Language, Information and Computation (PACLIC), 2023
Pin-Er Chen
Po-Ya Angela Wang
Hsin-Yu Chou
Yu-Hsiang Tseng
S. Hsieh
91
1
0
24 May 2023
VLAB: Enhancing Video Language Pre-training by Feature Adapting and Blending
IEEE transactions on multimedia (IEEE TMM), 2023
Xingjian He
Sihan Chen
Fan Ma
Zhicheng Huang
Xiaojie Jin
Zikang Liu
Dongmei Fu
Yi Yang
Qingbin Liu
Jiashi Feng
VLM
CLIP
293
23
0
22 May 2023
How does Contrastive Learning Organize Images?
Yunzhe Zhang
Yao Lu
Qi Xuan
SSL
163
2
0
17 May 2023
A Video Is Worth 4096 Tokens: Verbalize Videos To Understand Them In Zero Shot
Aanisha Bhattacharya
Yaman Kumar Singla
Balaji Krishnamurthy
R. Shah
Changyou Chen
VGen
314
14
0
16 May 2023
Self-Chained Image-Language Model for Video Localization and Question Answering
Neural Information Processing Systems (NeurIPS), 2023
Shoubin Yu
Jaemin Cho
Prateek Yadav
Joey Tianyi Zhou
395
199
0
11 May 2023
VideoChat: Chat-Centric Video Understanding
Kunchang Li
Yinan He
Yi Wang
Yizhuo Li
Wen Wang
Ping Luo
Yali Wang
Limin Wang
Yu Qiao
MLLM
378
788
0
10 May 2023
SignBERT+: Hand-model-aware Self-supervised Pre-training for Sign Language Understanding
IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2023
Hezhen Hu
Weichao Zhao
Wen-gang Zhou
Houqiang Li
ViT
252
118
0
08 May 2023
VideoOFA: Two-Stage Pre-Training for Video-to-Text Generation
Xilun Chen
L. Yu
Wenhan Xiong
Barlas Ouguz
Yashar Mehdad
Anuj Kumar
VGen
150
4
0
04 May 2023
In-Context Learning Unlocked for Diffusion Models
Neural Information Processing Systems (NeurIPS), 2023
Zhendong Wang
Lezhi Li
Yadong Lu
Yelong Shen
Pengcheng He
Weizhu Chen
Zinan Lin
Mingyuan Zhou
VLM
DiffM
333
96
0
01 May 2023
Early Detection of Alzheimer's Disease using Bottleneck Transformers
International Journal of Intelligent Information Technologies (IJIIT), 2022
Arunima Jaiswal
Ananya Sadana
MedIm
140
5
0
01 May 2023
Multimodal Graph Transformer for Multimodal Question Answering
Conference of the European Chapter of the Association for Computational Linguistics (EACL), 2023
Xuehai He
Xin Eric Wang
317
10
0
30 Apr 2023
SViTT: Temporal Learning of Sparse Video-Text Transformers
Computer Vision and Pattern Recognition (CVPR), 2023
Yi Li
Kyle Min
Subarna Tripathi
Nuno Vasconcelos
139
18
0
18 Apr 2023
LASER: A Neuro-Symbolic Framework for Learning Spatial-Temporal Scene Graphs with Weak Supervision
International Conference on Learning Representations (ICLR), 2023
Jiani Huang
Ziyang Li
Mayur Naik
Ser-Nam Lim
667
9
0
15 Apr 2023
How you feelin'? Learning Emotions and Mental States in Movie Scenes
Computer Vision and Pattern Recognition (CVPR), 2023
D. Srivastava
A. Singh
Makarand Tapaswi
226
11
0
12 Apr 2023
CAVL: Learning Contrastive and Adaptive Representations of Vision and Language
Shentong Mo
Jingfei Xia
Ihor Markevych
CLIP
VLM
199
1
0
10 Apr 2023
Previous
1
2
3
4
5
6
...
15
16
17
Next