ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2026 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 1904.01766
  4. Cited By
VideoBERT: A Joint Model for Video and Language Representation Learning
v1v2 (latest)

VideoBERT: A Joint Model for Video and Language Representation Learning

3 April 2019
Chen Sun
Austin Myers
Carl Vondrick
Kevin Patrick Murphy
Cordelia Schmid
    VLMSSL
ArXiv (abs)PDFHTML

Papers citing "VideoBERT: A Joint Model for Video and Language Representation Learning"

50 / 803 papers shown
SSAN: Separable Self-Attention Network for Video Representation Learning
SSAN: Separable Self-Attention Network for Video Representation LearningComputer Vision and Pattern Recognition (CVPR), 2021
Xudong Guo
Xun Guo
Yan Lu
ViTAI4TS
161
29
0
27 May 2021
Multi-modal Understanding and Generation for Medical Images and Text via
  Vision-Language Pre-Training
Multi-modal Understanding and Generation for Medical Images and Text via Vision-Language Pre-TrainingIEEE journal of biomedical and health informatics (JBHI), 2021
Jong Hak Moon
HyunGyung Lee
W. Shin
Young-Hak Kim
Edward Choi
MedIm
226
211
0
24 May 2021
Pretrained Language Models for Text Generation: A Survey
Pretrained Language Models for Text Generation: A Survey
Junyi Li
Tianyi Tang
Wayne Xin Zhao
Ji-Rong Wen
LM&MAVLMSyDa
261
206
0
21 May 2021
VLM: Task-agnostic Video-Language Model Pre-training for Video
  Understanding
VLM: Task-agnostic Video-Language Model Pre-training for Video UnderstandingFindings (Findings), 2021
Hu Xu
Gargi Ghosh
Po-Yao (Bernie) Huang
Prahal Arora
Masoumeh Aminzadeh
Christoph Feichtenhofer
Florian Metze
Luke Zettlemoyer
327
146
0
20 May 2021
NExT-QA:Next Phase of Question-Answering to Explaining Temporal Actions
NExT-QA:Next Phase of Question-Answering to Explaining Temporal ActionsComputer Vision and Pattern Recognition (CVPR), 2021
Junbin Xiao
Xindi Shang
Angela Yao
Tat-Seng Chua
390
721
0
18 May 2021
Episodic Transformer for Vision-and-Language Navigation
Episodic Transformer for Vision-and-Language NavigationIEEE International Conference on Computer Vision (ICCV), 2021
Alexander Pashevich
Cordelia Schmid
Chen Sun
LM&Ro
345
212
0
13 May 2021
Designing Multimodal Datasets for NLP Challenges
Designing Multimodal Datasets for NLP Challenges
James Pustejovsky
E. Holderness
Jingxuan Tu
Parker Glenn
Kyeongmin Rim
Kelley Lynch
R. Brutti
201
5
0
12 May 2021
Breaking Shortcut: Exploring Fully Convolutional Cycle-Consistency for
  Video Correspondence Learning
Breaking Shortcut: Exploring Fully Convolutional Cycle-Consistency for Video Correspondence Learning
Yansong Tang
Zhenyu Jiang
Zhenda Xie
Yue Cao
Zheng Zhang
Juil Sock
Han Hu
237
7
0
12 May 2021
Spoken Moments: Learning Joint Audio-Visual Representations from Video
  Descriptions
Spoken Moments: Learning Joint Audio-Visual Representations from Video DescriptionsComputer Vision and Pattern Recognition (CVPR), 2021
Mathew Monfort
SouYoung Jin
Alexander H. Liu
David Harwath
Rogerio Feris
James Glass
Aude Oliva
179
68
0
10 May 2021
Recent Advances in Deep Learning Based Dialogue Systems: A Systematic
  Survey
Recent Advances in Deep Learning Based Dialogue Systems: A Systematic SurveyArtificial Intelligence Review (AIR), 2021
Jinjie Ni
Tom Young
Vlad Pandelea
Fuzhao Xue
Xiaoshi Zhong
827
322
0
10 May 2021
ISTR: End-to-End Instance Segmentation with Transformers
ISTR: End-to-End Instance Segmentation with Transformers
Jie Hu
Liujuan Cao
Yao Lu
Shengchuan Zhang
Yan Wang
Ke Li
Feiyue Huang
Ling Shao
Rongrong Ji
ISeg
170
99
0
03 May 2021
MathBERT: A Pre-Trained Model for Mathematical Formula Understanding
MathBERT: A Pre-Trained Model for Mathematical Formula Understanding
Shuai Peng
Ke Yuan
Liangcai Gao
Zhi Tang
AIMat
220
119
0
02 May 2021
Multimodal Clustering Networks for Self-supervised Learning from
  Unlabeled Videos
Multimodal Clustering Networks for Self-supervised Learning from Unlabeled VideosIEEE International Conference on Computer Vision (ICCV), 2021
Brian Chen
Andrew Rouditchenko
Kevin Duarte
Hilde Kuehne
Samuel Thomas
...
Rogerio Feris
David Harwath
James R. Glass
M. Picheny
Shih-Fu Chang
SSL
429
96
0
26 Apr 2021
MusCaps: Generating Captions for Music Audio
MusCaps: Generating Captions for Music AudioIEEE International Joint Conference on Neural Network (IJCNN), 2021
Ilaria Manco
Emmanouil Benetos
Elio Quinton
Gyorgy Fazekas
281
43
0
24 Apr 2021
Playing Lottery Tickets with Vision and Language
Playing Lottery Tickets with Vision and LanguageAAAI Conference on Artificial Intelligence (AAAI), 2021
Zhe Gan
Yen-Chun Chen
Linjie Li
Tianlong Chen
Yu Cheng
Shuohang Wang
Jingjing Liu
Lijuan Wang
Zicheng Liu
VLM
303
62
0
23 Apr 2021
Skeletor: Skeletal Transformers for Robust Body-Pose Estimation
Skeletor: Skeletal Transformers for Robust Body-Pose Estimation
Tao Jiang
Necati Cihan Camgöz
Richard Bowden
ViT
238
45
0
23 Apr 2021
T2VLAD: Global-Local Sequence Alignment for Text-Video Retrieval
T2VLAD: Global-Local Sequence Alignment for Text-Video RetrievalComputer Vision and Pattern Recognition (CVPR), 2021
Xiaohan Wang
Linchao Zhu
Yi Yang
376
210
0
20 Apr 2021
Detector-Free Weakly Supervised Grounding by Separation
Detector-Free Weakly Supervised Grounding by SeparationIEEE International Conference on Computer Vision (ICCV), 2021
Assaf Arbelle
Sivan Doveh
Amit Alfassy
J. Shtok
Guy Lev
...
Kate Saenko
S. Ullman
Raja Giryes
Rogerio Feris
Leonid Karlinsky
182
31
0
20 Apr 2021
Temporal Query Networks for Fine-grained Video Understanding
Temporal Query Networks for Fine-grained Video UnderstandingComputer Vision and Pattern Recognition (CVPR), 2021
Chuhan Zhang
Ankush Gupta
Andrew Zisserman
254
98
0
19 Apr 2021
Understanding Chinese Video and Language via Contrastive Multimodal
  Pre-Training
Understanding Chinese Video and Language via Contrastive Multimodal Pre-TrainingACM Multimedia (ACM MM), 2021
Chenyi Lei
Shixian Luo
Yong Liu
Wanggui He
Jiamang Wang
Guoxin Wang
Haihong Tang
Chunyan Miao
Houqiang Li
163
47
0
19 Apr 2021
Multi-Modal Fusion Transformer for End-to-End Autonomous Driving
Multi-Modal Fusion Transformer for End-to-End Autonomous DrivingComputer Vision and Pattern Recognition (CVPR), 2021
Aditya Prakash
Kashyap Chitta
Andreas Geiger
ViT
274
636
0
19 Apr 2021
AMMU : A Survey of Transformer-based Biomedical Pretrained Language
  Models
AMMU : A Survey of Transformer-based Biomedical Pretrained Language ModelsJournal of Biomedical Informatics (JBI), 2021
Katikapalli Subramanyam Kalyan
A. Rajasekharan
S. Sangeetha
LM&MAMedIm
389
191
0
16 Apr 2021
Self-supervised object detection from audio-visual correspondence
Self-supervised object detection from audio-visual correspondenceComputer Vision and Pattern Recognition (CVPR), 2021
Triantafyllos Afouras
Yuki M. Asano
Francois Fagan
Andrea Vedaldi
Florian Metze
SSL
322
53
0
13 Apr 2021
FIBER: Fill-in-the-Blanks as a Challenging Video Understanding
  Evaluation Framework
FIBER: Fill-in-the-Blanks as a Challenging Video Understanding Evaluation FrameworkAnnual Meeting of the Association for Computational Linguistics (ACL), 2021
Santiago Castro
Ruoyao Wang
Pingxuan Huang
Ian Stewart
Oana Ignat
Nan Liu
Jonathan C. Stroud
Amélie Reymond
AIMat
281
12
0
09 Apr 2021
Seeing Out of tHe bOx: End-to-End Pre-training for Vision-Language
  Representation Learning
Seeing Out of tHe bOx: End-to-End Pre-training for Vision-Language Representation LearningComputer Vision and Pattern Recognition (CVPR), 2021
Zhicheng Huang
Zhaoyang Zeng
Yupan Huang
Bei Liu
Dongmei Fu
Jianlong Fu
VLMViT
424
303
0
07 Apr 2021
Compressing Visual-linguistic Model via Knowledge Distillation
Compressing Visual-linguistic Model via Knowledge DistillationIEEE International Conference on Computer Vision (ICCV), 2021
Zhiyuan Fang
Jianfeng Wang
Xiaowei Hu
Lijuan Wang
Yezhou Yang
Zicheng Liu
VLM
280
116
0
05 Apr 2021
Self-supervised Video Representation Learning by Context and Motion
  Decoupling
Self-supervised Video Representation Learning by Context and Motion DecouplingComputer Vision and Pattern Recognition (CVPR), 2021
Lianghua Huang
Yu Liu
Bin Wang
Pan Pan
Yinghui Xu
Rong Jin
SSL
223
55
0
02 Apr 2021
CUPID: Adaptive Curation of Pre-training Data for Video-and-Language
  Representation Learning
CUPID: Adaptive Curation of Pre-training Data for Video-and-Language Representation Learning
Luowei Zhou
Jingjing Liu
Yu Cheng
Zhe Gan
Lei Zhang
196
7
0
01 Apr 2021
Diagnosing Vision-and-Language Navigation: What Really Matters
Diagnosing Vision-and-Language Navigation: What Really MattersNorth American Chapter of the Association for Computational Linguistics (NAACL), 2021
Wanrong Zhu
Yuankai Qi
P. Narayana
Kazoo Sone
Sugato Basu
Xinze Wang
Qi Wu
Miguel P. Eckstein
Wenjie Wang
LM&Ro
233
55
0
30 Mar 2021
Broaden Your Views for Self-Supervised Video Learning
Broaden Your Views for Self-Supervised Video LearningIEEE International Conference on Computer Vision (ICCV), 2021
Adrià Recasens
Pauline Luc
Jean-Baptiste Alayrac
Luyu Wang
Ross Hemsley
...
Florent Altché
M. Valko
Jean-Bastien Grill
Aaron van den Oord
Andrew Zisserman
SSLAI4TS
293
138
0
30 Mar 2021
Kaleido-BERT: Vision-Language Pre-training on Fashion Domain
Kaleido-BERT: Vision-Language Pre-training on Fashion DomainComputer Vision and Pattern Recognition (CVPR), 2021
Mingchen Zhuge
D. Gao
Deng-Ping Fan
Linbo Jin
Ben Chen
Hao Zhou
Minghui Qiu
Ling Shao
VLM
344
134
0
30 Mar 2021
HiT: Hierarchical Transformer with Momentum Contrast for Video-Text
  Retrieval
HiT: Hierarchical Transformer with Momentum Contrast for Video-Text RetrievalIEEE International Conference on Computer Vision (ICCV), 2021
Song Liu
Haoqi Fan
Shengsheng Qian
Yiru Chen
Wenkui Ding
Zhongyuan Wang
336
165
0
28 Mar 2021
A Comprehensive Review of the Video-to-Text Problem
A Comprehensive Review of the Video-to-Text ProblemArtificial Intelligence Review (AIR), 2021
Jesus Perez-Martin
B. Bustos
S. Guimarães
I. Sipiran
Jorge A. Pérez
Grethel Coello Said
267
18
0
27 Mar 2021
Understanding Robustness of Transformers for Image Classification
Understanding Robustness of Transformers for Image ClassificationIEEE International Conference on Computer Vision (ICCV), 2021
Srinadh Bhojanapalli
Ayan Chakrabarti
Daniel Glasner
Daliang Li
Thomas Unterthiner
Andreas Veit
ViT
309
468
0
26 Mar 2021
VLGrammar: Grounded Grammar Induction of Vision and Language
VLGrammar: Grounded Grammar Induction of Vision and LanguageIEEE International Conference on Computer Vision (ICCV), 2021
Yining Hong
Qing Li
Song-Chun Zhu
Siyuan Huang
VLM
174
26
0
24 Mar 2021
DeepViT: Towards Deeper Vision Transformer
DeepViT: Towards Deeper Vision Transformer
Daquan Zhou
Bingyi Kang
Xiaojie Jin
Linjie Yang
Xiaochen Lian
Zihang Jiang
Qibin Hou
Jiashi Feng
ViT
338
601
0
22 Mar 2021
Incorporating Convolution Designs into Visual Transformers
Incorporating Convolution Designs into Visual TransformersIEEE International Conference on Computer Vision (ICCV), 2021
Kun Yuan
Shaopeng Guo
Ziwei Liu
Aojun Zhou
F. Yu
Wei Wu
ViT
297
566
0
22 Mar 2021
Let Your Heart Speak in its Mother Tongue: Multilingual Captioning of
  Cardiac Signals
Let Your Heart Speak in its Mother Tongue: Multilingual Captioning of Cardiac Signals
Dani Kiyasseh
T. Zhu
David Clifton
238
0
0
19 Mar 2021
MDMMT: Multidomain Multimodal Transformer for Video Retrieval
MDMMT: Multidomain Multimodal Transformer for Video Retrieval
Maksim Dzabraev
M. Kalashnikov
Stepan Alekseevich Komkov
Aleksandr Petiushko
221
148
0
19 Mar 2021
ConViT: Improving Vision Transformers with Soft Convolutional Inductive
  Biases
ConViT: Improving Vision Transformers with Soft Convolutional Inductive BiasesInternational Conference on Machine Learning (ICML), 2021
Stéphane dÁscoli
Hugo Touvron
Matthew L. Leavitt
Ari S. Morcos
Giulio Biroli
Levent Sagun
ViT
432
953
0
19 Mar 2021
Space-Time Crop & Attend: Improving Cross-modal Video Representation
  Learning
Space-Time Crop & Attend: Improving Cross-modal Video Representation LearningIEEE International Conference on Computer Vision (ICCV), 2021
Mandela Patrick
Yuki M. Asano
Bernie Huang
Ishan Misra
Florian Metze
Joao Henriques
Andrea Vedaldi
AI4TS
271
36
0
18 Mar 2021
Unified Pre-training for Program Understanding and Generation
Unified Pre-training for Program Understanding and GenerationNorth American Chapter of the Association for Computational Linguistics (NAACL), 2021
Wasi Uddin Ahmad
Saikat Chakraborty
Baishakhi Ray
Kai-Wei Chang
417
851
0
10 Mar 2021
Involution: Inverting the Inherence of Convolution for Visual
  Recognition
Involution: Inverting the Inherence of Convolution for Visual RecognitionComputer Vision and Pattern Recognition (CVPR), 2021
Duo Li
Jie Hu
Changhu Wang
Xiangtai Li
Qi She
Lei Zhu
Tong Zhang
Qifeng Chen
BDL
224
358
0
10 Mar 2021
Variable-rate discrete representation learning
Variable-rate discrete representation learning
Sander Dieleman
C. Nash
Jesse Engel
Karen Simonyan
BDLDRL
209
32
0
10 Mar 2021
VideoMoCo: Contrastive Video Representation Learning with Temporally
  Adversarial Examples
VideoMoCo: Contrastive Video Representation Learning with Temporally Adversarial ExamplesComputer Vision and Pattern Recognition (CVPR), 2021
Tian Pan
Yibing Song
Tianyu Yang
Wenhao Jiang
Wei Liu
244
252
0
10 Mar 2021
Perspectives and Prospects on Transformer Architecture for Cross-Modal
  Tasks with Language and Vision
Perspectives and Prospects on Transformer Architecture for Cross-Modal Tasks with Language and VisionInternational Journal of Computer Vision (IJCV), 2021
Andrew Shin
Masato Ishii
T. Narihira
289
49
0
06 Mar 2021
A Straightforward Framework For Video Retrieval Using CLIP
A Straightforward Framework For Video Retrieval Using CLIPMexican Conference on Pattern Recognition (MPR), 2021
Jesús Andrés Portillo-Quintero
J. C. Ortíz-Bayliss
Hugo Terashima-Marín
CLIP
719
134
0
24 Feb 2021
LambdaNetworks: Modeling Long-Range Interactions Without Attention
LambdaNetworks: Modeling Long-Range Interactions Without AttentionInternational Conference on Learning Representations (ICLR), 2021
Irwan Bello
506
187
0
17 Feb 2021
Less is More: ClipBERT for Video-and-Language Learning via Sparse
  Sampling
Less is More: ClipBERT for Video-and-Language Learning via Sparse SamplingComputer Vision and Pattern Recognition (CVPR), 2021
Jie Lei
Linjie Li
Luowei Zhou
Zhe Gan
Tamara L. Berg
Joey Tianyi Zhou
Jingjing Liu
CLIP
457
748
0
11 Feb 2021
Is Space-Time Attention All You Need for Video Understanding?
Is Space-Time Attention All You Need for Video Understanding?International Conference on Machine Learning (ICML), 2021
Gedas Bertasius
Heng Wang
Lorenzo Torresani
ViT
1.1K
2,648
0
09 Feb 2021
Previous
123...121314151617
Next