ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2026 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 1904.01766
  4. Cited By
VideoBERT: A Joint Model for Video and Language Representation Learning
v1v2 (latest)

VideoBERT: A Joint Model for Video and Language Representation Learning

3 April 2019
Chen Sun
Austin Myers
Carl Vondrick
Kevin Patrick Murphy
Cordelia Schmid
    VLMSSL
ArXiv (abs)PDFHTML

Papers citing "VideoBERT: A Joint Model for Video and Language Representation Learning"

50 / 803 papers shown
DIET: Lightweight Language Understanding for Dialogue Systems
DIET: Lightweight Language Understanding for Dialogue Systems
Tanja Bunk
Daksh Varshneya
Vladimir Vlasov
Alan Nichol
340
174
0
21 Apr 2020
lamBERT: Language and Action Learning Using Multimodal BERT
lamBERT: Language and Action Learning Using Multimodal BERT
Kazuki Miyazawa
Tatsuya Aoki
Takato Horii
Takayuki Nagai
SSLLM&Ro
168
12
0
15 Apr 2020
Coreferential Reasoning Learning for Language Representation
Coreferential Reasoning Learning for Language RepresentationConference on Empirical Methods in Natural Language Processing (EMNLP), 2020
Deming Ye
Yankai Lin
Jiaju Du
Zhenghao Liu
Peng Li
Maosong Sun
Zhiyuan Liu
235
184
0
15 Apr 2020
Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks
Oscar: Object-Semantics Aligned Pre-training for Vision-Language TasksEuropean Conference on Computer Vision (ECCV), 2020
Xiujun Li
Xi Yin
Chunyuan Li
Pengchuan Zhang
Xiaowei Hu
...
Houdong Hu
Li Dong
Furu Wei
Yejin Choi
Jianfeng Gao
VLM
727
2,133
0
13 Apr 2020
Context-Aware Group Captioning via Self-Attention and Contrastive
  Features
Context-Aware Group Captioning via Self-Attention and Contrastive FeaturesComputer Vision and Pattern Recognition (CVPR), 2020
Zhuowan Li
Quan Hung Tran
Long Mai
Zhe Lin
Alan Yuille
VLM
168
50
0
07 Apr 2020
Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal
  Transformers
Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Transformers
Zhicheng Huang
Zhaoyang Zeng
Bei Liu
Dongmei Fu
Jianlong Fu
ViT
382
469
0
02 Apr 2020
Caption Generation of Robot Behaviors based on Unsupervised Learning of
  Action Segments
Caption Generation of Robot Behaviors based on Unsupervised Learning of Action SegmentsInternational Workshop on Spoken Dialogue Systems Technology (SDST), 2020
Koichiro Yoshino
Kohei Wakimoto
Yuta Nishimura
Satoshi Nakamura
116
8
0
23 Mar 2020
Comprehensive Instructional Video Analysis: The COIN Dataset and
  Performance Evaluation
Comprehensive Instructional Video Analysis: The COIN Dataset and Performance EvaluationIEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2020
Yansong Tang
Jiwen Lu
Jie Zhou
175
41
0
20 Mar 2020
Pre-trained Models for Natural Language Processing: A Survey
Pre-trained Models for Natural Language Processing: A SurveyScience China Technological Sciences (Sci China Technol Sci), 2020
Xipeng Qiu
Tianxiang Sun
Yige Xu
Yunfan Shao
Ning Dai
Xuanjing Huang
LM&MAVLM
1.1K
1,616
0
18 Mar 2020
Video2Commonsense: Generating Commonsense Descriptions to Enrich Video
  Captioning
Video2Commonsense: Generating Commonsense Descriptions to Enrich Video CaptioningConference on Empirical Methods in Natural Language Processing (EMNLP), 2020
Zhiyuan Fang
Tejas Gokhale
Pratyay Banerjee
Chitta Baral
Yezhou Yang
291
65
0
11 Mar 2020
On Compositions of Transformations in Contrastive Self-Supervised
  Learning
On Compositions of Transformations in Contrastive Self-Supervised LearningIEEE International Conference on Computer Vision (ICCV), 2020
Mandela Patrick
Yuki M. Asano
Polina Kuznetsova
Ruth C. Fong
João F. Henriques
Geoffrey Zweig
Andrea Vedaldi
230
53
0
09 Mar 2020
Cross-modal Learning for Multi-modal Video Categorization
Cross-modal Learning for Multi-modal Video Categorization
Palash Goyal
Saurabh Sahu
Shalini Ghosh
Chul Lee
256
10
0
07 Mar 2020
Noise Estimation Using Density Estimation for Self-Supervised Multimodal
  Learning
Noise Estimation Using Density Estimation for Self-Supervised Multimodal LearningAAAI Conference on Artificial Intelligence (AAAI), 2020
Elad Amrani
Rami Ben-Ari
Daniel Rotman
A. Bronstein
324
130
0
06 Mar 2020
XGPT: Cross-modal Generative Pre-Training for Image Captioning
XGPT: Cross-modal Generative Pre-Training for Image CaptioningNatural Language Processing and Chinese Computing (NLPCC), 2020
Qiaolin Xia
Haoyang Huang
Nan Duan
Dongdong Zhang
Lei Ji
Zhifang Sui
Edward Cui
Taroon Bharti
Xin Liu
Ming Zhou
MLLMVLM
238
84
0
03 Mar 2020
Visual Commonsense R-CNN
Visual Commonsense R-CNNComputer Vision and Pattern Recognition (CVPR), 2020
Tan Wang
Jianqiang Huang
Hanwang Zhang
Qianru Sun
SSLObjDCML
268
278
0
27 Feb 2020
What BERT Sees: Cross-Modal Transfer for Visual Question Generation
What BERT Sees: Cross-Modal Transfer for Visual Question Generation
Thomas Scialom
Patrick Bordes
Paul-Alexis Dray
Jacopo Staiano
Patrick Gallinari
246
7
0
25 Feb 2020
Towards Learning a Generic Agent for Vision-and-Language Navigation via
  Pre-training
Towards Learning a Generic Agent for Vision-and-Language Navigation via Pre-trainingComputer Vision and Pattern Recognition (CVPR), 2020
Weituo Hao
Chunyuan Li
Xiujun Li
Lawrence Carin
Jianfeng Gao
LM&Ro
305
325
0
25 Feb 2020
CodeBERT: A Pre-Trained Model for Programming and Natural Languages
CodeBERT: A Pre-Trained Model for Programming and Natural LanguagesFindings (Findings), 2020
Zhangyin Feng
Daya Guo
Duyu Tang
Nan Duan
Xiaocheng Feng
...
Linjun Shou
Bing Qin
Ting Liu
Daxin Jiang
Ming Zhou
1.2K
3,355
0
19 Feb 2020
UniVL: A Unified Video and Language Pre-Training Model for Multimodal
  Understanding and Generation
UniVL: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation
Huaishao Luo
Lei Ji
Ding Wang
Haoyang Huang
Nan Duan
Tianrui Li
Jason Li
Xilin Chen
Ming Zhou
VLM
365
418
0
15 Feb 2020
Vocoder-free End-to-End Voice Conversion with Transformer Network
Vocoder-free End-to-End Voice Conversion with Transformer NetworkIEEE International Joint Conference on Neural Network (IJCNN), 2020
June-Woo Kim
H. Jung
Minho Lee
91
4
0
05 Feb 2020
Bridging Text and Video: A Universal Multimodal Transformer for
  Video-Audio Scene-Aware Dialog
Bridging Text and Video: A Universal Multimodal Transformer for Video-Audio Scene-Aware Dialog
Zekang Li
Zongjia Li
Jinchao Zhang
Yang Feng
Cheng Niu
Jie Zhou
235
38
0
01 Feb 2020
Learning Spatiotemporal Features via Video and Text Pair Discrimination
Learning Spatiotemporal Features via Video and Text Pair Discrimination
Tianhao Li
Limin Wang
VGen
141
60
0
16 Jan 2020
Meshed-Memory Transformer for Image Captioning
Meshed-Memory Transformer for Image CaptioningComputer Vision and Pattern Recognition (CVPR), 2019
Marcella Cornia
Matteo Stefanini
Lorenzo Baraldi
Rita Cucchiara
254
1,025
0
17 Dec 2019
End-to-End Learning of Visual Representations from Uncurated
  Instructional Videos
End-to-End Learning of Visual Representations from Uncurated Instructional VideosComputer Vision and Pattern Recognition (CVPR), 2019
Antoine Miech
Jean-Baptiste Alayrac
Lucas Smaira
Ivan Laptev
Josef Sivic
Andrew Zisserman
VGenSSL
599
754
0
13 Dec 2019
Listen to Look: Action Recognition by Previewing Audio
Listen to Look: Action Recognition by Previewing AudioComputer Vision and Pattern Recognition (CVPR), 2019
Ruohan Gao
Tae-Hyun Oh
Kristen Grauman
Lorenzo Torresani
VLM
307
282
0
10 Dec 2019
Context R-CNN: Long Term Temporal Context for Per-Camera Object
  Detection
Context R-CNN: Long Term Temporal Context for Per-Camera Object DetectionComputer Vision and Pattern Recognition (CVPR), 2019
Sara Beery
Guanhang Wu
V. Rathod
Ronny Votel
Jonathan Huang
ObjD
269
126
0
07 Dec 2019
Personalized Patent Claim Generation and Measurement
Personalized Patent Claim Generation and Measurement
Jieh-Sheng Lee
182
4
0
07 Dec 2019
Self-Supervised Learning of Video-Induced Visual Invariances
Self-Supervised Learning of Video-Induced Visual InvariancesComputer Vision and Pattern Recognition (CVPR), 2019
Michael Tschannen
Josip Djolonga
Marvin Ritter
Aravindh Mahendran
Xiaohua Zhai
N. Houlsby
Sylvain Gelly
Mario Lucic
SSL
345
65
0
05 Dec 2019
Large-scale Pretraining for Visual Dialog: A Simple State-of-the-Art
  Baseline
Large-scale Pretraining for Visual Dialog: A Simple State-of-the-Art BaselineEuropean Conference on Computer Vision (ECCV), 2019
Vishvak Murahari
Dhruv Batra
Devi Parikh
Abhishek Das
VLM
343
120
0
05 Dec 2019
BERT for Large-scale Video Segment Classification with Test-time
  Augmentation
BERT for Large-scale Video Segment Classification with Test-time Augmentation
Tianqi Liu
Qizhan Shao
123
4
0
02 Dec 2019
Learning to Learn Words from Visual Scenes
Learning to Learn Words from Visual Scenes
Dídac Surís
Dave Epstein
Heng Ji
Shih-Fu Chang
Carl Vondrick
VLMCLIPSSLOffRL
186
4
0
25 Nov 2019
Neural Storyboard Artist: Visualizing Stories with Coherent Image
  Sequences
Neural Storyboard Artist: Visualizing Stories with Coherent Image SequencesACM Multimedia (ACM MM), 2019
Shizhe Chen
Bei Liu
Jianlong Fu
Ruihua Song
Qin Jin
Pingping Lin
Xiaoyu Qi
Chunting Wang
Jin Zhou
DiffM
171
34
0
24 Nov 2019
Multimodal Intelligence: Representation Learning, Information Fusion,
  and Applications
Multimodal Intelligence: Representation Learning, Information Fusion, and ApplicationsIEEE Journal on Selected Topics in Signal Processing (JSTSP), 2019
Chao Zhang
Zichao Yang
Xiaodong He
Li Deng
HAIAI4TS
319
401
0
10 Nov 2019
Probing Contextualized Sentence Representations with Visual Awareness
Probing Contextualized Sentence Representations with Visual Awareness
Zhuosheng Zhang
Rui Wang
Kehai Chen
Masao Utiyama
Eiichiro Sumita
Hai Zhao
228
2
0
07 Nov 2019
A Case Study on Combining ASR and Visual Features for Generating
  Instructional Video Captions
A Case Study on Combining ASR and Visual Features for Generating Instructional Video CaptionsConference on Computational Natural Language Learning (CoNLL), 2019
Jack Hessel
Bo Pang
Zhenhai Zhu
Radu Soricut
169
39
0
07 Oct 2019
LoGAN: Latent Graph Co-Attention Network for Weakly-Supervised Video
  Moment Retrieval
LoGAN: Latent Graph Co-Attention Network for Weakly-Supervised Video Moment RetrievalIEEE Workshop/Winter Conference on Applications of Computer Vision (WACV), 2019
Reuben Tan
Huijuan Xu
Kate Saenko
Bryan A. Plummer
300
75
0
27 Sep 2019
UNITER: UNiversal Image-TExt Representation Learning
UNITER: UNiversal Image-TExt Representation LearningEuropean Conference on Computer Vision (ECCV), 2019
Yen-Chun Chen
Linjie Li
Licheng Yu
Ahmed El Kholy
Faisal Ahmed
Zhe Gan
Yu Cheng
Jingjing Liu
VLMOT
345
464
0
25 Sep 2019
Unified Vision-Language Pre-Training for Image Captioning and VQA
Unified Vision-Language Pre-Training for Image Captioning and VQAAAAI Conference on Artificial Intelligence (AAAI), 2019
Luowei Zhou
Hamid Palangi
Lei Zhang
Houdong Hu
Jason J. Corso
Jianfeng Gao
MLLMVLM
692
1,008
0
24 Sep 2019
Zero-Shot Action Recognition in Videos: A Survey
Zero-Shot Action Recognition in Videos: A Survey
Valter Estevam
Hélio Pedrini
David Menotti
285
61
0
13 Sep 2019
Supervised Multimodal Bitransformers for Classifying Images and Text
Supervised Multimodal Bitransformers for Classifying Images and Text
Douwe Kiela
Suvrat Bhooshan
Hamed Firooz
Ethan Perez
Davide Testuggine
323
295
0
06 Sep 2019
A Semantics-Assisted Video Captioning Model Trained with Scheduled
  Sampling
A Semantics-Assisted Video Captioning Model Trained with Scheduled SamplingFrontiers in Robotics and AI (Front. Robot. AI), 2019
Haoran Chen
Ke Lin
A. Maye
Jianmin Li
Xiaoling Hu
155
49
0
31 Aug 2019
VL-BERT: Pre-training of Generic Visual-Linguistic Representations
VL-BERT: Pre-training of Generic Visual-Linguistic RepresentationsInternational Conference on Learning Representations (ICLR), 2019
Weijie Su
Xizhou Zhu
Yue Cao
Bin Li
Lewei Lu
Furu Wei
Jifeng Dai
VLMMLLMSSL
628
1,795
0
22 Aug 2019
LXMERT: Learning Cross-Modality Encoder Representations from
  Transformers
LXMERT: Learning Cross-Modality Encoder Representations from TransformersConference on Empirical Methods in Natural Language Processing (EMNLP), 2019
Hao Hao Tan
Joey Tianyi Zhou
VLMMLLM
745
2,755
0
20 Aug 2019
Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal
  Pre-training
Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-trainingAAAI Conference on Artificial Intelligence (AAAI), 2019
Gen Li
Nan Duan
Yuejian Fang
Ming Gong
Daxin Jiang
Ming Zhou
SSLVLMMLLM
730
945
0
16 Aug 2019
Fusion of Detected Objects in Text for Visual Question Answering
Fusion of Detected Objects in Text for Visual Question AnsweringConference on Empirical Methods in Natural Language Processing (EMNLP), 2019
Chris Alberti
Jeffrey Ling
Michael Collins
David Reitter
251
181
0
14 Aug 2019
VisualBERT: A Simple and Performant Baseline for Vision and Language
VisualBERT: A Simple and Performant Baseline for Vision and Language
Liunian Harold Li
Mark Yatskar
Da Yin
Cho-Jui Hsieh
Kai-Wei Chang
VLM
578
2,202
0
09 Aug 2019
ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for
  Vision-and-Language Tasks
ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language TasksNeural Information Processing Systems (NeurIPS), 2019
Jiasen Lu
Dhruv Batra
Devi Parikh
Stefan Lee
SSLVLM
908
4,199
0
06 Aug 2019
Use What You Have: Video Retrieval Using Representations From
  Collaborative Experts
Use What You Have: Video Retrieval Using Representations From Collaborative ExpertsBritish Machine Vision Conference (BMVC), 2019
Yang Liu
Samuel Albanie
Arsha Nagrani
Andrew Zisserman
283
424
0
31 Jul 2019
Finding Moments in Video Collections Using Natural Language
Finding Moments in Video Collections Using Natural Language
Victor Escorcia
Mattia Soldan
Josef Sivic
Guohao Li
Bryan C. Russell
182
11
0
30 Jul 2019
Trends in Integration of Vision and Language Research: A Survey of
  Tasks, Datasets, and Methods
Trends in Integration of Vision and Language Research: A Survey of Tasks, Datasets, and MethodsJournal of Artificial Intelligence Research (JAIR), 2019
Aditya Mogadala
M. Kalimuthu
Dietrich Klakow
VLM
404
142
0
22 Jul 2019
Previous
123...151617
Next