Communities
Connect sessions
AI calendar
Organizations
Join Slack
Contact Sales
Search
Open menu
Home
Papers
1904.01766
Cited By
v1
v2 (latest)
VideoBERT: A Joint Model for Video and Language Representation Learning
3 April 2019
Chen Sun
Austin Myers
Carl Vondrick
Kevin Patrick Murphy
Cordelia Schmid
VLM
SSL
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"VideoBERT: A Joint Model for Video and Language Representation Learning"
50 / 803 papers shown
SSAN: Separable Self-Attention Network for Video Representation Learning
Computer Vision and Pattern Recognition (CVPR), 2021
Xudong Guo
Xun Guo
Yan Lu
ViT
AI4TS
161
29
0
27 May 2021
Multi-modal Understanding and Generation for Medical Images and Text via Vision-Language Pre-Training
IEEE journal of biomedical and health informatics (JBHI), 2021
Jong Hak Moon
HyunGyung Lee
W. Shin
Young-Hak Kim
Edward Choi
MedIm
226
211
0
24 May 2021
Pretrained Language Models for Text Generation: A Survey
Junyi Li
Tianyi Tang
Wayne Xin Zhao
Ji-Rong Wen
LM&MA
VLM
SyDa
261
206
0
21 May 2021
VLM: Task-agnostic Video-Language Model Pre-training for Video Understanding
Findings (Findings), 2021
Hu Xu
Gargi Ghosh
Po-Yao (Bernie) Huang
Prahal Arora
Masoumeh Aminzadeh
Christoph Feichtenhofer
Florian Metze
Luke Zettlemoyer
327
146
0
20 May 2021
NExT-QA:Next Phase of Question-Answering to Explaining Temporal Actions
Computer Vision and Pattern Recognition (CVPR), 2021
Junbin Xiao
Xindi Shang
Angela Yao
Tat-Seng Chua
390
721
0
18 May 2021
Episodic Transformer for Vision-and-Language Navigation
IEEE International Conference on Computer Vision (ICCV), 2021
Alexander Pashevich
Cordelia Schmid
Chen Sun
LM&Ro
345
212
0
13 May 2021
Designing Multimodal Datasets for NLP Challenges
James Pustejovsky
E. Holderness
Jingxuan Tu
Parker Glenn
Kyeongmin Rim
Kelley Lynch
R. Brutti
201
5
0
12 May 2021
Breaking Shortcut: Exploring Fully Convolutional Cycle-Consistency for Video Correspondence Learning
Yansong Tang
Zhenyu Jiang
Zhenda Xie
Yue Cao
Zheng Zhang
Juil Sock
Han Hu
237
7
0
12 May 2021
Spoken Moments: Learning Joint Audio-Visual Representations from Video Descriptions
Computer Vision and Pattern Recognition (CVPR), 2021
Mathew Monfort
SouYoung Jin
Alexander H. Liu
David Harwath
Rogerio Feris
James Glass
Aude Oliva
179
68
0
10 May 2021
Recent Advances in Deep Learning Based Dialogue Systems: A Systematic Survey
Artificial Intelligence Review (AIR), 2021
Jinjie Ni
Tom Young
Vlad Pandelea
Fuzhao Xue
Xiaoshi Zhong
827
322
0
10 May 2021
ISTR: End-to-End Instance Segmentation with Transformers
Jie Hu
Liujuan Cao
Yao Lu
Shengchuan Zhang
Yan Wang
Ke Li
Feiyue Huang
Ling Shao
Rongrong Ji
ISeg
170
99
0
03 May 2021
MathBERT: A Pre-Trained Model for Mathematical Formula Understanding
Shuai Peng
Ke Yuan
Liangcai Gao
Zhi Tang
AIMat
220
119
0
02 May 2021
Multimodal Clustering Networks for Self-supervised Learning from Unlabeled Videos
IEEE International Conference on Computer Vision (ICCV), 2021
Brian Chen
Andrew Rouditchenko
Kevin Duarte
Hilde Kuehne
Samuel Thomas
...
Rogerio Feris
David Harwath
James R. Glass
M. Picheny
Shih-Fu Chang
SSL
429
96
0
26 Apr 2021
MusCaps: Generating Captions for Music Audio
IEEE International Joint Conference on Neural Network (IJCNN), 2021
Ilaria Manco
Emmanouil Benetos
Elio Quinton
Gyorgy Fazekas
281
43
0
24 Apr 2021
Playing Lottery Tickets with Vision and Language
AAAI Conference on Artificial Intelligence (AAAI), 2021
Zhe Gan
Yen-Chun Chen
Linjie Li
Tianlong Chen
Yu Cheng
Shuohang Wang
Jingjing Liu
Lijuan Wang
Zicheng Liu
VLM
303
62
0
23 Apr 2021
Skeletor: Skeletal Transformers for Robust Body-Pose Estimation
Tao Jiang
Necati Cihan Camgöz
Richard Bowden
ViT
238
45
0
23 Apr 2021
T2VLAD: Global-Local Sequence Alignment for Text-Video Retrieval
Computer Vision and Pattern Recognition (CVPR), 2021
Xiaohan Wang
Linchao Zhu
Yi Yang
376
210
0
20 Apr 2021
Detector-Free Weakly Supervised Grounding by Separation
IEEE International Conference on Computer Vision (ICCV), 2021
Assaf Arbelle
Sivan Doveh
Amit Alfassy
J. Shtok
Guy Lev
...
Kate Saenko
S. Ullman
Raja Giryes
Rogerio Feris
Leonid Karlinsky
182
31
0
20 Apr 2021
Temporal Query Networks for Fine-grained Video Understanding
Computer Vision and Pattern Recognition (CVPR), 2021
Chuhan Zhang
Ankush Gupta
Andrew Zisserman
254
98
0
19 Apr 2021
Understanding Chinese Video and Language via Contrastive Multimodal Pre-Training
ACM Multimedia (ACM MM), 2021
Chenyi Lei
Shixian Luo
Yong Liu
Wanggui He
Jiamang Wang
Guoxin Wang
Haihong Tang
Chunyan Miao
Houqiang Li
163
47
0
19 Apr 2021
Multi-Modal Fusion Transformer for End-to-End Autonomous Driving
Computer Vision and Pattern Recognition (CVPR), 2021
Aditya Prakash
Kashyap Chitta
Andreas Geiger
ViT
274
636
0
19 Apr 2021
AMMU : A Survey of Transformer-based Biomedical Pretrained Language Models
Journal of Biomedical Informatics (JBI), 2021
Katikapalli Subramanyam Kalyan
A. Rajasekharan
S. Sangeetha
LM&MA
MedIm
389
191
0
16 Apr 2021
Self-supervised object detection from audio-visual correspondence
Computer Vision and Pattern Recognition (CVPR), 2021
Triantafyllos Afouras
Yuki M. Asano
Francois Fagan
Andrea Vedaldi
Florian Metze
SSL
322
53
0
13 Apr 2021
FIBER: Fill-in-the-Blanks as a Challenging Video Understanding Evaluation Framework
Annual Meeting of the Association for Computational Linguistics (ACL), 2021
Santiago Castro
Ruoyao Wang
Pingxuan Huang
Ian Stewart
Oana Ignat
Nan Liu
Jonathan C. Stroud
Amélie Reymond
AIMat
281
12
0
09 Apr 2021
Seeing Out of tHe bOx: End-to-End Pre-training for Vision-Language Representation Learning
Computer Vision and Pattern Recognition (CVPR), 2021
Zhicheng Huang
Zhaoyang Zeng
Yupan Huang
Bei Liu
Dongmei Fu
Jianlong Fu
VLM
ViT
424
303
0
07 Apr 2021
Compressing Visual-linguistic Model via Knowledge Distillation
IEEE International Conference on Computer Vision (ICCV), 2021
Zhiyuan Fang
Jianfeng Wang
Xiaowei Hu
Lijuan Wang
Yezhou Yang
Zicheng Liu
VLM
280
116
0
05 Apr 2021
Self-supervised Video Representation Learning by Context and Motion Decoupling
Computer Vision and Pattern Recognition (CVPR), 2021
Lianghua Huang
Yu Liu
Bin Wang
Pan Pan
Yinghui Xu
Rong Jin
SSL
223
55
0
02 Apr 2021
CUPID: Adaptive Curation of Pre-training Data for Video-and-Language Representation Learning
Luowei Zhou
Jingjing Liu
Yu Cheng
Zhe Gan
Lei Zhang
196
7
0
01 Apr 2021
Diagnosing Vision-and-Language Navigation: What Really Matters
North American Chapter of the Association for Computational Linguistics (NAACL), 2021
Wanrong Zhu
Yuankai Qi
P. Narayana
Kazoo Sone
Sugato Basu
Xinze Wang
Qi Wu
Miguel P. Eckstein
Wenjie Wang
LM&Ro
233
55
0
30 Mar 2021
Broaden Your Views for Self-Supervised Video Learning
IEEE International Conference on Computer Vision (ICCV), 2021
Adrià Recasens
Pauline Luc
Jean-Baptiste Alayrac
Luyu Wang
Ross Hemsley
...
Florent Altché
M. Valko
Jean-Bastien Grill
Aaron van den Oord
Andrew Zisserman
SSL
AI4TS
293
138
0
30 Mar 2021
Kaleido-BERT: Vision-Language Pre-training on Fashion Domain
Computer Vision and Pattern Recognition (CVPR), 2021
Mingchen Zhuge
D. Gao
Deng-Ping Fan
Linbo Jin
Ben Chen
Hao Zhou
Minghui Qiu
Ling Shao
VLM
344
134
0
30 Mar 2021
HiT: Hierarchical Transformer with Momentum Contrast for Video-Text Retrieval
IEEE International Conference on Computer Vision (ICCV), 2021
Song Liu
Haoqi Fan
Shengsheng Qian
Yiru Chen
Wenkui Ding
Zhongyuan Wang
336
165
0
28 Mar 2021
A Comprehensive Review of the Video-to-Text Problem
Artificial Intelligence Review (AIR), 2021
Jesus Perez-Martin
B. Bustos
S. Guimarães
I. Sipiran
Jorge A. Pérez
Grethel Coello Said
267
18
0
27 Mar 2021
Understanding Robustness of Transformers for Image Classification
IEEE International Conference on Computer Vision (ICCV), 2021
Srinadh Bhojanapalli
Ayan Chakrabarti
Daniel Glasner
Daliang Li
Thomas Unterthiner
Andreas Veit
ViT
309
468
0
26 Mar 2021
VLGrammar: Grounded Grammar Induction of Vision and Language
IEEE International Conference on Computer Vision (ICCV), 2021
Yining Hong
Qing Li
Song-Chun Zhu
Siyuan Huang
VLM
174
26
0
24 Mar 2021
DeepViT: Towards Deeper Vision Transformer
Daquan Zhou
Bingyi Kang
Xiaojie Jin
Linjie Yang
Xiaochen Lian
Zihang Jiang
Qibin Hou
Jiashi Feng
ViT
338
601
0
22 Mar 2021
Incorporating Convolution Designs into Visual Transformers
IEEE International Conference on Computer Vision (ICCV), 2021
Kun Yuan
Shaopeng Guo
Ziwei Liu
Aojun Zhou
F. Yu
Wei Wu
ViT
297
566
0
22 Mar 2021
Let Your Heart Speak in its Mother Tongue: Multilingual Captioning of Cardiac Signals
Dani Kiyasseh
T. Zhu
David Clifton
238
0
0
19 Mar 2021
MDMMT: Multidomain Multimodal Transformer for Video Retrieval
Maksim Dzabraev
M. Kalashnikov
Stepan Alekseevich Komkov
Aleksandr Petiushko
221
148
0
19 Mar 2021
ConViT: Improving Vision Transformers with Soft Convolutional Inductive Biases
International Conference on Machine Learning (ICML), 2021
Stéphane dÁscoli
Hugo Touvron
Matthew L. Leavitt
Ari S. Morcos
Giulio Biroli
Levent Sagun
ViT
432
953
0
19 Mar 2021
Space-Time Crop & Attend: Improving Cross-modal Video Representation Learning
IEEE International Conference on Computer Vision (ICCV), 2021
Mandela Patrick
Yuki M. Asano
Bernie Huang
Ishan Misra
Florian Metze
Joao Henriques
Andrea Vedaldi
AI4TS
271
36
0
18 Mar 2021
Unified Pre-training for Program Understanding and Generation
North American Chapter of the Association for Computational Linguistics (NAACL), 2021
Wasi Uddin Ahmad
Saikat Chakraborty
Baishakhi Ray
Kai-Wei Chang
417
851
0
10 Mar 2021
Involution: Inverting the Inherence of Convolution for Visual Recognition
Computer Vision and Pattern Recognition (CVPR), 2021
Duo Li
Jie Hu
Changhu Wang
Xiangtai Li
Qi She
Lei Zhu
Tong Zhang
Qifeng Chen
BDL
224
358
0
10 Mar 2021
Variable-rate discrete representation learning
Sander Dieleman
C. Nash
Jesse Engel
Karen Simonyan
BDL
DRL
209
32
0
10 Mar 2021
VideoMoCo: Contrastive Video Representation Learning with Temporally Adversarial Examples
Computer Vision and Pattern Recognition (CVPR), 2021
Tian Pan
Yibing Song
Tianyu Yang
Wenhao Jiang
Wei Liu
244
252
0
10 Mar 2021
Perspectives and Prospects on Transformer Architecture for Cross-Modal Tasks with Language and Vision
International Journal of Computer Vision (IJCV), 2021
Andrew Shin
Masato Ishii
T. Narihira
289
49
0
06 Mar 2021
A Straightforward Framework For Video Retrieval Using CLIP
Mexican Conference on Pattern Recognition (MPR), 2021
Jesús Andrés Portillo-Quintero
J. C. Ortíz-Bayliss
Hugo Terashima-Marín
CLIP
719
134
0
24 Feb 2021
LambdaNetworks: Modeling Long-Range Interactions Without Attention
International Conference on Learning Representations (ICLR), 2021
Irwan Bello
506
187
0
17 Feb 2021
Less is More: ClipBERT for Video-and-Language Learning via Sparse Sampling
Computer Vision and Pattern Recognition (CVPR), 2021
Jie Lei
Linjie Li
Luowei Zhou
Zhe Gan
Tamara L. Berg
Joey Tianyi Zhou
Jingjing Liu
CLIP
457
748
0
11 Feb 2021
Is Space-Time Attention All You Need for Video Understanding?
International Conference on Machine Learning (ICML), 2021
Gedas Bertasius
Heng Wang
Lorenzo Torresani
ViT
1.1K
2,648
0
09 Feb 2021
Previous
1
2
3
...
12
13
14
15
16
17
Next