Communities
Connect sessions
AI calendar
Organizations
Join Slack
Contact Sales
Search
Open menu
Home
Papers
1904.01766
Cited By
v1
v2 (latest)
VideoBERT: A Joint Model for Video and Language Representation Learning
3 April 2019
Chen Sun
Austin Myers
Carl Vondrick
Kevin Patrick Murphy
Cordelia Schmid
VLM
SSL
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"VideoBERT: A Joint Model for Video and Language Representation Learning"
50 / 802 papers shown
Title
Pathological Visual Question Answering
Xuehai He
Zhuo Cai
Wenlan Wei
Yichen Zhang
Luntian Mou
Eric Xing
P. Xie
282
30
0
06 Oct 2020
Hard Negative Mixing for Contrastive Learning
Neural Information Processing Systems (NeurIPS), 2020
Yannis Kalantidis
Mert Bulent Sariyildiz
Noé Pion
Philippe Weinzaepfel
Diane Larlus
SSL
503
713
0
02 Oct 2020
X-LXMERT: Paint, Caption and Answer Questions with Multi-Modal Transformers
Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020
Jaemin Cho
Jiasen Lu
Dustin Schwenk
Hannaneh Hajishirzi
Aniruddha Kembhavi
VLM
MLLM
194
106
0
23 Sep 2020
A Multimodal Memes Classification: A Survey and Open Research Issues
Tariq Habib Afridi
A. Alam
Muhammad Numan Khan
Jawad Khan
Young-Koo Lee
202
43
0
17 Sep 2020
Multi-modal Summarization for Video-containing Documents
Xiyan Fu
Jun Wang
Zhenglu Yang
134
26
0
17 Sep 2020
Knowledge Guided Learning: Towards Open Domain Egocentric Action Recognition with Zero Supervision
Sathyanarayanan N. Aakur
Sanjoy Kundu
Nikhil Gunti
EgoV
135
1
0
16 Sep 2020
Active Contrastive Learning of Audio-Visual Video Representations
Shuang Ma
Zhaoyang Zeng
Daniel J. McDuff
Yale Song
VLM
SSL
164
9
0
31 Aug 2020
DeVLBert: Learning Deconfounded Visio-Linguistic Representations
Shengyu Zhang
Tan Jiang
Tan Wang
Kun Kuang
Zhou Zhao
Jianke Zhu
Jin Yu
Hongxia Yang
Leilei Gan
OOD
203
94
0
16 Aug 2020
Weakly supervised cross-domain alignment with optimal transport
Siyang Yuan
Ke Bai
Liqun Chen
Yizhe Zhang
Chenyang Tao
Chunyuan Li
Guoyin Wang
Ricardo Henao
Lawrence Carin
OT
154
7
0
14 Aug 2020
Spatiotemporal Contrastive Video Representation Learning
Computer Vision and Pattern Recognition (CVPR), 2020
Rui Qian
Tianjian Meng
Boqing Gong
Ming-Hsuan Yang
Jian Shu
Serge J. Belongie
Huayu Chen
SSL
AI4TS
381
543
0
09 Aug 2020
ConvBERT: Improving BERT with Span-based Dynamic Convolution
Neural Information Processing Systems (NeurIPS), 2020
Zihang Jiang
Weihao Yu
Daquan Zhou
Yunpeng Chen
Jiashi Feng
Shuicheng Yan
334
198
0
06 Aug 2020
Learning Visual Representations with Caption Annotations
Mert Bulent Sariyildiz
J. Perez
Diane Larlus
VLM
SSL
254
171
0
04 Aug 2020
Neural Language Generation: Formulation, Methods, and Evaluation
Cristina Garbacea
Qiaozhu Mei
349
29
0
31 Jul 2020
Spatially Aware Multimodal Transformers for TextVQA
European Conference on Computer Vision (ECCV), 2020
Yash Kant
Dhruv Batra
Peter Anderson
Alex Schwing
Devi Parikh
Jiasen Lu
Harsh Agrawal
191
93
0
23 Jul 2020
Uncertainty-Aware Weakly Supervised Action Detection from Untrimmed Videos
Anurag Arnab
Chen Sun
Arsha Nagrani
Cordelia Schmid
146
30
0
21 Jul 2020
Multi-modal Transformer for Video Retrieval
Valentin Gabeur
Chen Sun
Alahari Karteek
Cordelia Schmid
ViT
1.1K
674
0
21 Jul 2020
Towards Debiasing Sentence Representations
Annual Meeting of the Association for Computational Linguistics (ACL), 2020
Paul Pu Liang
Irene Li
Emily Zheng
Y. Lim
Ruslan Salakhutdinov
Louis-Philippe Morency
209
269
0
16 Jul 2020
Multimodal Text Style Transfer for Outdoor Vision-and-Language Navigation
Wanrong Zhu
Xinze Wang
Tsu-Jui Fu
An Yan
P. Narayana
Kazoo Sone
Sugato Basu
Wenjie Wang
335
38
0
01 Jul 2020
ERNIE-ViL: Knowledge Enhanced Vision-Language Representations Through Scene Graph
Fei Yu
Jiji Tang
Weichong Yin
Yu Sun
Hao Tian
Hua Wu
Haifeng Wang
380
399
0
30 Jun 2020
Self-Supervised MultiModal Versatile Networks
Jean-Baptiste Alayrac
Adrià Recasens
R. Schneider
Relja Arandjelović
Jason Ramapuram
J. Fauw
Lucas Smaira
Sander Dieleman
Andrew Zisserman
SSL
373
397
0
29 Jun 2020
Video Representation Learning with Visual Tempo Consistency
Ceyuan Yang
Yinghao Xu
Bo Dai
Bolei Zhou
146
94
0
28 Jun 2020
Video-Grounded Dialogues with Pretrained Generation Language Models
Hung Le
Guosheng Lin
170
31
0
27 Jun 2020
Unsupervised Video Decomposition using Spatio-temporal Iterative Inference
Polina Zablotskaia
E. Dominici
Leonid Sigal
Andreas M. Lehrmann
OCL
264
20
0
25 Jun 2020
Labelling unlabelled videos from scratch with multi-modal self-supervision
Neural Information Processing Systems (NeurIPS), 2020
Yuki M. Asano
Mandela Patrick
Christian Rupprecht
Andrea Vedaldi
SSL
257
161
0
24 Jun 2020
Learning Potentials of Quantum Systems using Deep Neural Networks
Arijit Sehanobish
H. Corzo
Onur Kara
David van Dijk
131
12
0
23 Jun 2020
Automating Text Naturalness Evaluation of NLG Systems
Erion cCano
Ondrej Bojar
86
0
0
23 Jun 2020
Weak Supervision and Referring Attention for Temporal-Textual Association Learning
Zhiyuan Fang
Shu Kong
Zhe Wang
Charless C. Fowlkes
Yezhou Yang
120
20
0
21 Jun 2020
Contrastive Learning for Weakly Supervised Phrase Grounding
Tanmay Gupta
Arash Vahdat
Gal Chechik
Xiaodong Yang
Jan Kautz
Derek Hoiem
ObjD
SSL
280
157
0
17 Jun 2020
AVLnet: Learning Audio-Visual Language Representations from Instructional Videos
Andrew Rouditchenko
Angie Boggust
David Harwath
Brian Chen
D. Joshi
...
Rogerio Feris
Brian Kingsbury
M. Picheny
Antonio Torralba
James R. Glass
SSL
226
142
0
16 Jun 2020
Video Understanding as Machine Translation
Bruno Korbar
Fabio Petroni
Rohit Girdhar
Lorenzo Torresani
SSL
195
29
0
12 Jun 2020
Large-Scale Adversarial Training for Vision-and-Language Representation Learning
Neural Information Processing Systems (NeurIPS), 2020
Zhe Gan
Yen-Chun Chen
Linjie Li
Chen Zhu
Yu Cheng
Jingjing Liu
ObjD
VLM
350
536
0
11 Jun 2020
In the Eye of the Beholder: Gaze and Actions in First Person Video
IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2020
Yin Li
Miao Liu
James M. Rehg
EgoV
266
92
0
31 May 2020
Med-BERT: pre-trained contextualized embeddings on large-scale structured electronic health records for disease prediction
L. Rasmy
Yang Xiang
Z. Xie
Cui Tao
Degui Zhi
AI4MH
LM&MA
266
839
0
22 May 2020
FashionBERT: Text and Image Matching with Adaptive Loss for Cross-modal Retrieval
D. Gao
Linbo Jin
Ben Chen
Minghui Qiu
Peng Li
Yi Wei
Yitao Hu
Haozhe Jasper Wang
OOD
205
146
0
20 May 2020
Human-like general language processing
Feng Qi
Guanjun Jiang
AI4CE
84
2
0
19 May 2020
Behind the Scene: Revealing the Secrets of Pre-trained Vision-and-Language Models
Jize Cao
Zhe Gan
Yu Cheng
Licheng Yu
Yen-Chun Chen
Jingjing Liu
VLM
260
138
0
15 May 2020
Cross-Modality Relevance for Reasoning on Language and Vision
Chen Zheng
Quan Guo
Parisa Kordjamshidi
LRM
130
37
0
12 May 2020
MART: Memory-Augmented Recurrent Transformer for Coherent Video Paragraph Captioning
Jie Lei
Liwei Wang
Yelong Shen
Dong Yu
Tamara L. Berg
Joey Tianyi Zhou
197
200
0
11 May 2020
VectorNet: Encoding HD Maps and Agent Dynamics from Vectorized Representation
Jiyang Gao
Chen Sun
Hang Zhao
Yi Shen
Dragomir Anguelov
Congcong Li
Cordelia Schmid
392
962
0
08 May 2020
Condensed Movies: Story Based Retrieval with Contextual Embeddings
Max Bain
Arsha Nagrani
A. Brown
Andrew Zisserman
371
110
0
08 May 2020
Cross-media Structured Common Space for Multimedia Event Extraction
Annual Meeting of the Association for Computational Linguistics (ACL), 2020
Pengfei Yu
Alireza Zareian
Qi Zeng
Spencer Whitehead
Di Lu
Heng Ji
Shih-Fu Chang
159
116
0
05 May 2020
A Benchmark for Structured Procedural Knowledge Extraction from Cooking Videos
Frank F. Xu
Lei Ji
Ding Wang
Junyi Du
Graham Neubig
Yonatan Bisk
Nan Duan
119
21
0
02 May 2020
HERO: Hierarchical Encoder for Video+Language Omni-representation Pre-training
Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020
Linjie Li
Yen-Chun Chen
Yu Cheng
Zhe Gan
Licheng Yu
Jingjing Liu
MLLM
VLM
OffRL
AI4TS
661
536
0
01 May 2020
Beyond Instructional Videos: Probing for More Diverse Visual-Textual Grounding on YouTube
Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020
Jack Hessel
Zhenhai Zhu
Bo Pang
Radu Soricut
198
4
0
29 Apr 2020
Span-based Localizing Network for Natural Language Video Localization
Annual Meeting of the Association for Computational Linguistics (ACL), 2020
Hao Zhang
Aixin Sun
Wei Jing
Qiufeng Wang
343
362
0
29 Apr 2020
VD-BERT: A Unified Vision and Dialog Transformer with BERT
Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020
Yue Wang
Shafiq Joty
Michael R. Lyu
Irwin King
Caiming Xiong
Guosheng Lin
335
107
0
28 Apr 2020
ColBERT: Using BERT Sentence Embedding in Parallel Neural Networks for Computational Humor
Expert systems with applications (ESWA), 2020
Issa Annamoradnejad
Gohar Zoghi
228
35
0
27 Apr 2020
MCQA: Multimodal Co-attention Based Network for Question Answering
Abhishek Kumar
Trisha Mittal
Tianyi Zhou
100
15
0
25 Apr 2020
Experience Grounds Language
Yonatan Bisk
Ari Holtzman
Jesse Thomason
Jacob Andreas
Yoshua Bengio
...
Angeliki Lazaridou
Jonathan May
Aleksandr Nisnevich
Nicolas Pinto
Joseph P. Turian
479
397
0
21 Apr 2020
DIET: Lightweight Language Understanding for Dialogue Systems
Tanja Bunk
Daksh Varshneya
Vladimir Vlasov
Alan Nichol
320
173
0
21 Apr 2020
Previous
1
2
3
...
14
15
16
17
Next