Communities
Connect sessions
AI calendar
Organizations
Join Slack
Contact Sales
Search
Open menu
Home
Papers
1904.01766
Cited By
v1
v2 (latest)
VideoBERT: A Joint Model for Video and Language Representation Learning
3 April 2019
Chen Sun
Austin Myers
Carl Vondrick
Kevin Patrick Murphy
Cordelia Schmid
VLM
SSL
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"VideoBERT: A Joint Model for Video and Language Representation Learning"
50 / 803 papers shown
Dynamic Neural Networks: A Survey
IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2021
Yizeng Han
Gao Huang
Shiji Song
Le Yang
Honghui Wang
Yulin Wang
3DH
AI4TS
AI4CE
425
802
0
09 Feb 2021
Unifying Vision-and-Language Tasks via Text Generation
International Conference on Machine Learning (ICML), 2021
Jaemin Cho
Jie Lei
Hao Tan
Joey Tianyi Zhou
MLLM
598
609
0
04 Feb 2021
Environment Predictive Coding for Embodied Agents
Santhosh Kumar Ramakrishnan
Tushar Nagarajan
Ziad Al-Halah
Kristen Grauman
195
14
0
03 Feb 2021
Decoupling the Role of Data, Attention, and Losses in Multimodal Transformers
Transactions of the Association for Computational Linguistics (TACL), 2021
Lisa Anne Hendricks
John F. J. Mellor
R. Schneider
Jean-Baptiste Alayrac
Aida Nematzadeh
234
126
0
31 Jan 2021
VX2TEXT: End-to-End Learning of Video-Based Text Generation From Multimodal Inputs
Computer Vision and Pattern Recognition (CVPR), 2021
Xudong Lin
Gedas Bertasius
Jue Wang
Shih-Fu Chang
Devi Parikh
Lorenzo Torresani
VGen
248
74
0
28 Jan 2021
Bottleneck Transformers for Visual Recognition
Computer Vision and Pattern Recognition (CVPR), 2021
A. Srinivas
Nayeon Lee
Niki Parmar
Jonathon Shlens
Pieter Abbeel
Ashish Vaswani
SLR
681
1,124
0
27 Jan 2021
AI Choreographer: Music Conditioned 3D Dance Generation with AIST++
IEEE International Conference on Computer Vision (ICCV), 2021
Ruilong Li
Sha Yang
David A. Ross
Angjoo Kanazawa
ViT
739
637
0
21 Jan 2021
Reasoning over Vision and Language: Exploring the Benefits of Supplemental Knowledge
Violetta Shevchenko
Damien Teney
A. Dick
Anton Van Den Hengel
213
31
0
15 Jan 2021
Learning Temporal Dynamics from Cycles in Narrated Video
IEEE International Conference on Computer Vision (ICCV), 2021
Dave Epstein
Jiajun Wu
Cordelia Schmid
Chen Sun
AI4TS
252
15
0
07 Jan 2021
Transformers in Vision: A Survey
ACM Computing Surveys (CSUR), 2021
Salman Khan
Muzammal Naseer
Munawar Hayat
Syed Waqas Zamir
Fahad Shahbaz Khan
M. Shah
ViT
923
3,152
0
04 Jan 2021
Accurate Word Representations with Universal Visual Guidance
Zhuosheng Zhang
Haojie Yu
Hai Zhao
Rui Wang
Masao Utiyama
182
0
0
30 Dec 2020
LayoutLMv2: Multi-modal Pre-training for Visually-Rich Document Understanding
Annual Meeting of the Association for Computational Linguistics (ACL), 2020
Yang Xu
Yiheng Xu
Tengchao Lv
Lei Cui
Furu Wei
...
D. Florêncio
Cha Zhang
Wanxiang Che
Min Zhang
Lidong Zhou
ViT
MLLM
840
610
0
29 Dec 2020
Training data-efficient image transformers & distillation through attention
International Conference on Machine Learning (ICML), 2020
Hugo Touvron
Matthieu Cord
Matthijs Douze
Francisco Massa
Alexandre Sablayrolles
Edouard Grave
ViT
647
8,277
0
23 Dec 2020
A Survey on Visual Transformer
IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2020
Kai Han
Yunhe Wang
Hanting Chen
Xinghao Chen
Jianyuan Guo
...
Chunjing Xu
Yixing Xu
Zhaohui Yang
Yiman Zhang
Dacheng Tao
ViT
1.0K
3,095
0
23 Dec 2020
Human Action Recognition from Various Data Modalities: A Review
IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2020
Zehua Sun
Qiuhong Ke
Hossein Rahmani
Mohammed Bennamoun
Gang Wang
Jun Liu
MU
582
699
0
22 Dec 2020
A Closer Look at the Robustness of Vision-and-Language Pre-trained Models
Linjie Li
Zhe Gan
Jingjing Liu
VLM
263
50
0
15 Dec 2020
Attention over learned object embeddings enables complex visual reasoning
Neural Information Processing Systems (NeurIPS), 2020
David Ding
Felix Hill
Adam Santoro
Malcolm Reynolds
M. Botvinick
OCL
366
78
0
15 Dec 2020
KVL-BERT: Knowledge Enhanced Visual-and-Linguistic BERT for Visual Commonsense Reasoning
Knowledge-Based Systems (KBS), 2020
Dandan Song
S. Ma
Zhanchen Sun
Sicheng Yang
L. Liao
SSL
LRM
256
42
0
13 Dec 2020
A Comprehensive Study of Deep Video Action Recognition
Yi Zhu
Xinyu Li
Chunhui Liu
Mohammadreza Zolfaghari
Yuanjun Xiong
Chongruo Wu
Zhi-Li Zhang
Joseph Tighe
R. Manmatha
Mu Li
VLM
AI4TS
283
210
0
11 Dec 2020
Look Before you Speak: Visually Contextualized Utterances
Computer Vision and Pattern Recognition (CVPR), 2020
Paul Hongsuck Seo
Arsha Nagrani
Cordelia Schmid
311
71
0
10 Dec 2020
Hateful Memes Detection via Complementary Visual and Linguistic Networks
W. Zhang
Guihua Liu
Zhuohua Li
Fuqing Zhu
104
21
0
09 Dec 2020
Parameter Efficient Multimodal Transformers for Video Representation Learning
Sangho Lee
Youngjae Yu
Gunhee Kim
Thomas Breuel
Jan Kautz
Yale Song
ViT
272
89
0
08 Dec 2020
Deep Learning and the Global Workspace Theory
Trends in Neurosciences (TINS), 2020
R. V. Rullen
Ryota Kanai
202
78
0
04 Dec 2020
Classification of Multimodal Hate Speech -- The Winning Solution of Hateful Memes Challenge
Xiayu Zhong
149
16
0
02 Dec 2020
Pose-based Sign Language Recognition using GCN and BERT
Anirudh Tunga
Sai Vidyaranya Nuthalapati
J. Wachs
SLR
200
93
0
01 Dec 2020
Task Programming: Learning Data Efficient Behavior Representations
Computer Vision and Pattern Recognition (CVPR), 2020
Jennifer J. Sun
Ann Kennedy
Eric Zhan
David J. Anderson
Yisong Yue
Pietro Perona
266
63
0
27 Nov 2020
A Recurrent Vision-and-Language BERT for Navigation
Computer Vision and Pattern Recognition (CVPR), 2020
Yicong Hong
Qi Wu
Yuankai Qi
Cristian Rodriguez-Opazo
Stephen Gould
LM&Ro
326
382
0
26 Nov 2020
Multimodal Learning for Hateful Memes Detection
Yi Zhou
Zhenhao Chen
307
73
0
25 Nov 2020
Neuro-Symbolic Representations for Video Captioning: A Case for Leveraging Inductive Biases for Vision and Language
Hassan Akbari
Hamid Palangi
Jianwei Yang
Sudha Rao
Asli Celikyilmaz
Roland Fernandez
P. Smolensky
Jianfeng Gao
Shih-Fu Chang
205
3
0
18 Nov 2020
A Hierarchical Multi-Modal Encoder for Moment Localization in Video Corpus
Bowen Zhang
Hexiang Hu
Joonseok Lee
Mingde Zhao
Sheide Chammas
Vihan Jain
Eugene Ie
Fei Sha
203
39
0
18 Nov 2020
Data-efficient Alignment of Multimodal Sequences by Aligning Gradient Updates and Internal Feature Distributions
IEEE Workshop/Winter Conference on Applications of Computer Vision (WACV), 2020
Jianan Wang
Boyang Albert Li
Xiangyu Fan
Jing-Hua Lin
Yanwei Fu
146
3
0
15 Nov 2020
ActBERT: Learning Global-Local Video-Text Representations
Computer Vision and Pattern Recognition (CVPR), 2020
Linchao Zhu
Yi Yang
ViT
324
451
0
14 Nov 2020
Multimodal Pretraining for Dense Video Captioning
Gabriel Huang
Bo Pang
Zhenhai Zhu
Clara E. Rivera
Radu Soricut
181
101
0
10 Nov 2020
Tabular Transformers for Modeling Multivariate Time Series
Inkit Padhi
Yair Schiff
Igor Melnyk
Mattia Rigotti
Youssef Mroueh
Pierre Dognin
Jerret Ross
Ravi Nair
Erik Altman
LMTD
AI4TS
287
114
0
03 Nov 2020
COOT: Cooperative Hierarchical Transformer for Video-Text Representation Learning
Neural Information Processing Systems (NeurIPS), 2020
Simon Ging
Mohammadreza Zolfaghari
Hamed Pirsiavash
Thomas Brox
ViT
CLIP
204
178
0
01 Nov 2020
Pretext-Contrastive Learning: Toward Good Practices in Self-supervised Video Representation Leaning
L. Tao
Xueting Wang
T. Yamasaki
VLM
SSL
250
14
0
29 Oct 2020
A Visuospatial Dataset for Naturalistic Verb Learning
Dylan Ebert
Ellie Pavlick
113
7
0
28 Oct 2020
Co-attentional Transformers for Story-Based Video Understanding
IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2020
Björn Bebensee
Byoung-Tak Zhang
136
7
0
27 Oct 2020
Multilingual Speech Translation with Efficient Finetuning of Pretrained Models
Xian Li
Changhan Wang
Yun Tang
C. Tran
Yuqing Tang
J. Pino
Alexei Baevski
Alexis Conneau
Michael Auli
281
6
0
24 Oct 2020
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
Alexey Dosovitskiy
Lucas Beyer
Alexander Kolesnikov
Dirk Weissenborn
Xiaohua Zhai
...
Matthias Minderer
G. Heigold
Sylvain Gelly
Jakob Uszkoreit
N. Houlsby
ViT
1.4K
55,030
0
22 Oct 2020
A Framework for Generative and Contrastive Learning of Audio Representations
Prateek Verma
J. Smith
SSL
198
21
0
22 Oct 2020
Multimodal Research in Vision and Language: A Review of Current and Emerging Trends
Shagun Uppal
Sarthak Bhagat
Devamanyu Hazarika
Navonil Majumdar
Soujanya Poria
Roger Zimmermann
Amir Zadeh
277
6
0
19 Oct 2020
Knowledge-Grounded Dialogue Generation with Pre-trained Language Models
Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020
Xueliang Zhao
Wei Wu
Can Xu
Chongyang Tao
Dongyan Zhao
Rui Yan
410
201
0
17 Oct 2020
Answer-checking in Context: A Multi-modal FullyAttention Network for Visual Question Answering
International Conference on Pattern Recognition (ICPR), 2020
Hantao Huang
Tao Han
Wei Han
D. Yap
Cheng-Ming Chiang
133
4
0
17 Oct 2020
Unsupervised Natural Language Inference via Decoupled Multimodal Contrastive Learning
Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020
Wanyun Cui
Guangyu Zheng
Wei Wang
SSL
137
21
0
16 Oct 2020
Natural Language Rationales with Full-Stack Visual Reasoning: From Pixels to Semantic Frames to Commonsense Graphs
Ana Marasović
Chandra Bhagavatula
J. S. Park
Ronan Le Bras
Noah A. Smith
Yejin Choi
ReLM
LRM
232
63
0
15 Oct 2020
CAPT: Contrastive Pre-Training for Learning Denoised Sequence Representations
Fuli Luo
Pengcheng Yang
Shicheng Li
Xuancheng Ren
Xu Sun
VLM
SSL
212
16
0
13 Oct 2020
ALFWorld: Aligning Text and Embodied Environments for Interactive Learning
Mohit Shridhar
Xingdi Yuan
Marc-Alexandre Côté
Yonatan Bisk
Adam Trischler
Matthew J. Hausknecht
LM&Ro
LLMAG
415
640
0
08 Oct 2020
Global Self-Attention Networks for Image Recognition
Zhuoran Shen
Irwan Bello
Raviteja Vemulapalli
Xuhui Jia
Ching-Hui Chen
ViT
174
32
0
06 Oct 2020
Support-set bottlenecks for video-text representation learning
Mandela Patrick
Po-Yao (Bernie) Huang
Yuki M. Asano
Florian Metze
Alexander G. Hauptmann
João Henriques
Andrea Vedaldi
342
260
0
06 Oct 2020
Previous
1
2
3
...
13
14
15
16
17
Next