ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2026 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 1904.01766
  4. Cited By
VideoBERT: A Joint Model for Video and Language Representation Learning
v1v2 (latest)

VideoBERT: A Joint Model for Video and Language Representation Learning

3 April 2019
Chen Sun
Austin Myers
Carl Vondrick
Kevin Patrick Murphy
Cordelia Schmid
    VLMSSL
ArXiv (abs)PDFHTML

Papers citing "VideoBERT: A Joint Model for Video and Language Representation Learning"

50 / 803 papers shown
Dynamic Neural Networks: A Survey
Dynamic Neural Networks: A SurveyIEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2021
Yizeng Han
Gao Huang
Shiji Song
Le Yang
Honghui Wang
Yulin Wang
3DHAI4TSAI4CE
425
802
0
09 Feb 2021
Unifying Vision-and-Language Tasks via Text Generation
Unifying Vision-and-Language Tasks via Text GenerationInternational Conference on Machine Learning (ICML), 2021
Jaemin Cho
Jie Lei
Hao Tan
Joey Tianyi Zhou
MLLM
598
609
0
04 Feb 2021
Environment Predictive Coding for Embodied Agents
Environment Predictive Coding for Embodied Agents
Santhosh Kumar Ramakrishnan
Tushar Nagarajan
Ziad Al-Halah
Kristen Grauman
195
14
0
03 Feb 2021
Decoupling the Role of Data, Attention, and Losses in Multimodal
  Transformers
Decoupling the Role of Data, Attention, and Losses in Multimodal TransformersTransactions of the Association for Computational Linguistics (TACL), 2021
Lisa Anne Hendricks
John F. J. Mellor
R. Schneider
Jean-Baptiste Alayrac
Aida Nematzadeh
234
126
0
31 Jan 2021
VX2TEXT: End-to-End Learning of Video-Based Text Generation From
  Multimodal Inputs
VX2TEXT: End-to-End Learning of Video-Based Text Generation From Multimodal InputsComputer Vision and Pattern Recognition (CVPR), 2021
Xudong Lin
Gedas Bertasius
Jue Wang
Shih-Fu Chang
Devi Parikh
Lorenzo Torresani
VGen
248
74
0
28 Jan 2021
Bottleneck Transformers for Visual Recognition
Bottleneck Transformers for Visual RecognitionComputer Vision and Pattern Recognition (CVPR), 2021
A. Srinivas
Nayeon Lee
Niki Parmar
Jonathon Shlens
Pieter Abbeel
Ashish Vaswani
SLR
681
1,124
0
27 Jan 2021
AI Choreographer: Music Conditioned 3D Dance Generation with AIST++
AI Choreographer: Music Conditioned 3D Dance Generation with AIST++IEEE International Conference on Computer Vision (ICCV), 2021
Ruilong Li
Sha Yang
David A. Ross
Angjoo Kanazawa
ViT
739
637
0
21 Jan 2021
Reasoning over Vision and Language: Exploring the Benefits of
  Supplemental Knowledge
Reasoning over Vision and Language: Exploring the Benefits of Supplemental Knowledge
Violetta Shevchenko
Damien Teney
A. Dick
Anton Van Den Hengel
213
31
0
15 Jan 2021
Learning Temporal Dynamics from Cycles in Narrated Video
Learning Temporal Dynamics from Cycles in Narrated VideoIEEE International Conference on Computer Vision (ICCV), 2021
Dave Epstein
Jiajun Wu
Cordelia Schmid
Chen Sun
AI4TS
252
15
0
07 Jan 2021
Transformers in Vision: A Survey
Transformers in Vision: A SurveyACM Computing Surveys (CSUR), 2021
Salman Khan
Muzammal Naseer
Munawar Hayat
Syed Waqas Zamir
Fahad Shahbaz Khan
M. Shah
ViT
923
3,152
0
04 Jan 2021
Accurate Word Representations with Universal Visual Guidance
Accurate Word Representations with Universal Visual Guidance
Zhuosheng Zhang
Haojie Yu
Hai Zhao
Rui Wang
Masao Utiyama
182
0
0
30 Dec 2020
LayoutLMv2: Multi-modal Pre-training for Visually-Rich Document
  Understanding
LayoutLMv2: Multi-modal Pre-training for Visually-Rich Document UnderstandingAnnual Meeting of the Association for Computational Linguistics (ACL), 2020
Yang Xu
Yiheng Xu
Tengchao Lv
Lei Cui
Furu Wei
...
D. Florêncio
Cha Zhang
Wanxiang Che
Min Zhang
Lidong Zhou
ViTMLLM
840
610
0
29 Dec 2020
Training data-efficient image transformers & distillation through
  attention
Training data-efficient image transformers & distillation through attentionInternational Conference on Machine Learning (ICML), 2020
Hugo Touvron
Matthieu Cord
Matthijs Douze
Francisco Massa
Alexandre Sablayrolles
Edouard Grave
ViT
647
8,277
0
23 Dec 2020
A Survey on Visual Transformer
A Survey on Visual TransformerIEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2020
Kai Han
Yunhe Wang
Hanting Chen
Xinghao Chen
Jianyuan Guo
...
Chunjing Xu
Yixing Xu
Zhaohui Yang
Yiman Zhang
Dacheng Tao
ViT
1.0K
3,095
0
23 Dec 2020
Human Action Recognition from Various Data Modalities: A Review
Human Action Recognition from Various Data Modalities: A ReviewIEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2020
Zehua Sun
Qiuhong Ke
Hossein Rahmani
Mohammed Bennamoun
Gang Wang
Jun Liu
MU
582
699
0
22 Dec 2020
A Closer Look at the Robustness of Vision-and-Language Pre-trained
  Models
A Closer Look at the Robustness of Vision-and-Language Pre-trained Models
Linjie Li
Zhe Gan
Jingjing Liu
VLM
263
50
0
15 Dec 2020
Attention over learned object embeddings enables complex visual
  reasoning
Attention over learned object embeddings enables complex visual reasoningNeural Information Processing Systems (NeurIPS), 2020
David Ding
Felix Hill
Adam Santoro
Malcolm Reynolds
M. Botvinick
OCL
366
78
0
15 Dec 2020
KVL-BERT: Knowledge Enhanced Visual-and-Linguistic BERT for Visual
  Commonsense Reasoning
KVL-BERT: Knowledge Enhanced Visual-and-Linguistic BERT for Visual Commonsense ReasoningKnowledge-Based Systems (KBS), 2020
Dandan Song
S. Ma
Zhanchen Sun
Sicheng Yang
L. Liao
SSLLRM
256
42
0
13 Dec 2020
A Comprehensive Study of Deep Video Action Recognition
A Comprehensive Study of Deep Video Action Recognition
Yi Zhu
Xinyu Li
Chunhui Liu
Mohammadreza Zolfaghari
Yuanjun Xiong
Chongruo Wu
Zhi-Li Zhang
Joseph Tighe
R. Manmatha
Mu Li
VLMAI4TS
283
210
0
11 Dec 2020
Look Before you Speak: Visually Contextualized Utterances
Look Before you Speak: Visually Contextualized UtterancesComputer Vision and Pattern Recognition (CVPR), 2020
Paul Hongsuck Seo
Arsha Nagrani
Cordelia Schmid
311
71
0
10 Dec 2020
Hateful Memes Detection via Complementary Visual and Linguistic Networks
Hateful Memes Detection via Complementary Visual and Linguistic Networks
W. Zhang
Guihua Liu
Zhuohua Li
Fuqing Zhu
104
21
0
09 Dec 2020
Parameter Efficient Multimodal Transformers for Video Representation
  Learning
Parameter Efficient Multimodal Transformers for Video Representation Learning
Sangho Lee
Youngjae Yu
Gunhee Kim
Thomas Breuel
Jan Kautz
Yale Song
ViT
272
89
0
08 Dec 2020
Deep Learning and the Global Workspace Theory
Deep Learning and the Global Workspace TheoryTrends in Neurosciences (TINS), 2020
R. V. Rullen
Ryota Kanai
202
78
0
04 Dec 2020
Classification of Multimodal Hate Speech -- The Winning Solution of
  Hateful Memes Challenge
Classification of Multimodal Hate Speech -- The Winning Solution of Hateful Memes Challenge
Xiayu Zhong
149
16
0
02 Dec 2020
Pose-based Sign Language Recognition using GCN and BERT
Pose-based Sign Language Recognition using GCN and BERT
Anirudh Tunga
Sai Vidyaranya Nuthalapati
J. Wachs
SLR
200
93
0
01 Dec 2020
Task Programming: Learning Data Efficient Behavior Representations
Task Programming: Learning Data Efficient Behavior RepresentationsComputer Vision and Pattern Recognition (CVPR), 2020
Jennifer J. Sun
Ann Kennedy
Eric Zhan
David J. Anderson
Yisong Yue
Pietro Perona
266
63
0
27 Nov 2020
A Recurrent Vision-and-Language BERT for Navigation
A Recurrent Vision-and-Language BERT for NavigationComputer Vision and Pattern Recognition (CVPR), 2020
Yicong Hong
Qi Wu
Yuankai Qi
Cristian Rodriguez-Opazo
Stephen Gould
LM&Ro
326
382
0
26 Nov 2020
Multimodal Learning for Hateful Memes Detection
Multimodal Learning for Hateful Memes Detection
Yi Zhou
Zhenhao Chen
307
73
0
25 Nov 2020
Neuro-Symbolic Representations for Video Captioning: A Case for
  Leveraging Inductive Biases for Vision and Language
Neuro-Symbolic Representations for Video Captioning: A Case for Leveraging Inductive Biases for Vision and Language
Hassan Akbari
Hamid Palangi
Jianwei Yang
Sudha Rao
Asli Celikyilmaz
Roland Fernandez
P. Smolensky
Jianfeng Gao
Shih-Fu Chang
205
3
0
18 Nov 2020
A Hierarchical Multi-Modal Encoder for Moment Localization in Video
  Corpus
A Hierarchical Multi-Modal Encoder for Moment Localization in Video Corpus
Bowen Zhang
Hexiang Hu
Joonseok Lee
Mingde Zhao
Sheide Chammas
Vihan Jain
Eugene Ie
Fei Sha
203
39
0
18 Nov 2020
Data-efficient Alignment of Multimodal Sequences by Aligning Gradient
  Updates and Internal Feature Distributions
Data-efficient Alignment of Multimodal Sequences by Aligning Gradient Updates and Internal Feature DistributionsIEEE Workshop/Winter Conference on Applications of Computer Vision (WACV), 2020
Jianan Wang
Boyang Albert Li
Xiangyu Fan
Jing-Hua Lin
Yanwei Fu
146
3
0
15 Nov 2020
ActBERT: Learning Global-Local Video-Text Representations
ActBERT: Learning Global-Local Video-Text RepresentationsComputer Vision and Pattern Recognition (CVPR), 2020
Linchao Zhu
Yi Yang
ViT
324
451
0
14 Nov 2020
Multimodal Pretraining for Dense Video Captioning
Multimodal Pretraining for Dense Video Captioning
Gabriel Huang
Bo Pang
Zhenhai Zhu
Clara E. Rivera
Radu Soricut
181
101
0
10 Nov 2020
Tabular Transformers for Modeling Multivariate Time Series
Tabular Transformers for Modeling Multivariate Time Series
Inkit Padhi
Yair Schiff
Igor Melnyk
Mattia Rigotti
Youssef Mroueh
Pierre Dognin
Jerret Ross
Ravi Nair
Erik Altman
LMTDAI4TS
287
114
0
03 Nov 2020
COOT: Cooperative Hierarchical Transformer for Video-Text Representation
  Learning
COOT: Cooperative Hierarchical Transformer for Video-Text Representation LearningNeural Information Processing Systems (NeurIPS), 2020
Simon Ging
Mohammadreza Zolfaghari
Hamed Pirsiavash
Thomas Brox
ViTCLIP
204
178
0
01 Nov 2020
Pretext-Contrastive Learning: Toward Good Practices in Self-supervised
  Video Representation Leaning
Pretext-Contrastive Learning: Toward Good Practices in Self-supervised Video Representation Leaning
L. Tao
Xueting Wang
T. Yamasaki
VLMSSL
250
14
0
29 Oct 2020
A Visuospatial Dataset for Naturalistic Verb Learning
A Visuospatial Dataset for Naturalistic Verb Learning
Dylan Ebert
Ellie Pavlick
113
7
0
28 Oct 2020
Co-attentional Transformers for Story-Based Video Understanding
Co-attentional Transformers for Story-Based Video UnderstandingIEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2020
Björn Bebensee
Byoung-Tak Zhang
136
7
0
27 Oct 2020
Multilingual Speech Translation with Efficient Finetuning of Pretrained
  Models
Multilingual Speech Translation with Efficient Finetuning of Pretrained Models
Xian Li
Changhan Wang
Yun Tang
C. Tran
Yuqing Tang
J. Pino
Alexei Baevski
Alexis Conneau
Michael Auli
281
6
0
24 Oct 2020
An Image is Worth 16x16 Words: Transformers for Image Recognition at
  Scale
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
Alexey Dosovitskiy
Lucas Beyer
Alexander Kolesnikov
Dirk Weissenborn
Xiaohua Zhai
...
Matthias Minderer
G. Heigold
Sylvain Gelly
Jakob Uszkoreit
N. Houlsby
ViT
1.4K
55,030
0
22 Oct 2020
A Framework for Generative and Contrastive Learning of Audio
  Representations
A Framework for Generative and Contrastive Learning of Audio Representations
Prateek Verma
J. Smith
SSL
198
21
0
22 Oct 2020
Multimodal Research in Vision and Language: A Review of Current and
  Emerging Trends
Multimodal Research in Vision and Language: A Review of Current and Emerging Trends
Shagun Uppal
Sarthak Bhagat
Devamanyu Hazarika
Navonil Majumdar
Soujanya Poria
Roger Zimmermann
Amir Zadeh
277
6
0
19 Oct 2020
Knowledge-Grounded Dialogue Generation with Pre-trained Language Models
Knowledge-Grounded Dialogue Generation with Pre-trained Language ModelsConference on Empirical Methods in Natural Language Processing (EMNLP), 2020
Xueliang Zhao
Wei Wu
Can Xu
Chongyang Tao
Dongyan Zhao
Rui Yan
410
201
0
17 Oct 2020
Answer-checking in Context: A Multi-modal FullyAttention Network for
  Visual Question Answering
Answer-checking in Context: A Multi-modal FullyAttention Network for Visual Question AnsweringInternational Conference on Pattern Recognition (ICPR), 2020
Hantao Huang
Tao Han
Wei Han
D. Yap
Cheng-Ming Chiang
133
4
0
17 Oct 2020
Unsupervised Natural Language Inference via Decoupled Multimodal
  Contrastive Learning
Unsupervised Natural Language Inference via Decoupled Multimodal Contrastive LearningConference on Empirical Methods in Natural Language Processing (EMNLP), 2020
Wanyun Cui
Guangyu Zheng
Wei Wang
SSL
137
21
0
16 Oct 2020
Natural Language Rationales with Full-Stack Visual Reasoning: From
  Pixels to Semantic Frames to Commonsense Graphs
Natural Language Rationales with Full-Stack Visual Reasoning: From Pixels to Semantic Frames to Commonsense Graphs
Ana Marasović
Chandra Bhagavatula
J. S. Park
Ronan Le Bras
Noah A. Smith
Yejin Choi
ReLMLRM
232
63
0
15 Oct 2020
CAPT: Contrastive Pre-Training for Learning Denoised Sequence
  Representations
CAPT: Contrastive Pre-Training for Learning Denoised Sequence Representations
Fuli Luo
Pengcheng Yang
Shicheng Li
Xuancheng Ren
Xu Sun
VLMSSL
212
16
0
13 Oct 2020
ALFWorld: Aligning Text and Embodied Environments for Interactive
  Learning
ALFWorld: Aligning Text and Embodied Environments for Interactive Learning
Mohit Shridhar
Xingdi Yuan
Marc-Alexandre Côté
Yonatan Bisk
Adam Trischler
Matthew J. Hausknecht
LM&RoLLMAG
415
640
0
08 Oct 2020
Global Self-Attention Networks for Image Recognition
Global Self-Attention Networks for Image Recognition
Zhuoran Shen
Irwan Bello
Raviteja Vemulapalli
Xuhui Jia
Ching-Hui Chen
ViT
174
32
0
06 Oct 2020
Support-set bottlenecks for video-text representation learning
Support-set bottlenecks for video-text representation learning
Mandela Patrick
Po-Yao (Bernie) Huang
Yuki M. Asano
Florian Metze
Alexander G. Hauptmann
João Henriques
Andrea Vedaldi
342
260
0
06 Oct 2020
Previous
123...1314151617
Next