ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2026 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 1904.01766
  4. Cited By
VideoBERT: A Joint Model for Video and Language Representation Learning
v1v2 (latest)

VideoBERT: A Joint Model for Video and Language Representation Learning

3 April 2019
Chen Sun
Austin Myers
Carl Vondrick
Kevin Patrick Murphy
Cordelia Schmid
    VLMSSL
ArXiv (abs)PDFHTML

Papers citing "VideoBERT: A Joint Model for Video and Language Representation Learning"

50 / 803 papers shown
Video ChatCaptioner: Towards Enriched Spatiotemporal Descriptions
Video ChatCaptioner: Towards Enriched Spatiotemporal Descriptions
Jun Chen
Deyao Zhu
Kilichbek Haydarov
Xiang Li
Mohamed Elhoseiny
264
44
0
09 Apr 2023
Scalable and Accurate Self-supervised Multimodal Representation Learning
  without Aligned Video and Text Data
Scalable and Accurate Self-supervised Multimodal Representation Learning without Aligned Video and Text Data
Vladislav Lialin
Stephen Rawls
David M. Chan
Shalini Ghosh
Anna Rumshisky
Wael Hamza
VLMAI4TS
267
8
0
04 Apr 2023
Beyond Unimodal: Generalising Neural Processes for Multimodal
  Uncertainty Estimation
Beyond Unimodal: Generalising Neural Processes for Multimodal Uncertainty EstimationNeural Information Processing Systems (NeurIPS), 2023
M. Jung
He Zhao
Joanna Dipnall
Lan Du
UQCVBDL
259
11
0
04 Apr 2023
Unbiased Scene Graph Generation in Videos
Unbiased Scene Graph Generation in VideosComputer Vision and Pattern Recognition (CVPR), 2023
Sayak Nag
Kyle Min
Subarna Tripathi
Amit K. Roy-Chowdhury
428
40
0
03 Apr 2023
Procedure-Aware Pretraining for Instructional Video Understanding
Procedure-Aware Pretraining for Instructional Video UnderstandingComputer Vision and Pattern Recognition (CVPR), 2023
Honglu Zhou
Roberto Martín-Martín
Mubbasir Kapadia
Silvio Savarese
Juan Carlos Niebles
290
55
0
31 Mar 2023
Self-Supervised Multimodal Learning: A Survey
Self-Supervised Multimodal Learning: A SurveyIEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2023
Yongshuo Zong
Oisin Mac Aodha
Timothy M. Hospedales
SSL
319
89
0
31 Mar 2023
Learning Procedure-aware Video Representation from Instructional Videos
  and Their Narrations
Learning Procedure-aware Video Representation from Instructional Videos and Their NarrationsComputer Vision and Pattern Recognition (CVPR), 2023
Yiwu Zhong
Licheng Yu
Yang Bai
Shangwen Li
Xueting Yan
Yin Li
AI4TS
236
46
0
31 Mar 2023
Dual Cross-Attention for Medical Image Segmentation
Dual Cross-Attention for Medical Image SegmentationEngineering applications of artificial intelligence (Eng. Appl. Artif. Intell.), 2023
Gorkem Can Ates
P. Mohan
Emrah Çelik
164
137
0
30 Mar 2023
Object Discovery from Motion-Guided Tokens
Object Discovery from Motion-Guided TokensComputer Vision and Pattern Recognition (CVPR), 2023
Zhipeng Bao
P. Tokmakov
Yu-Xiong Wang
Adrien Gaidon
M. Hebert
OCL
204
28
0
27 Mar 2023
RGBT Tracking via Progressive Fusion Transformer with Dynamically Guided Learning
Yabin Zhu
Chenglong Li
Tianlin Li
Jin Tang
Zhixiang Huang
219
15
0
26 Mar 2023
Selective Structured State-Spaces for Long-Form Video Understanding
Selective Structured State-Spaces for Long-Form Video UnderstandingComputer Vision and Pattern Recognition (CVPR), 2023
Jue Wang
Wenjie Zhu
Pichao Wang
Xiang Yu
Linda Liu
Mohamed Omar
Raffay Hamid
208
159
0
25 Mar 2023
Task-Attentive Transformer Architecture for Continual Learning of
  Vision-and-Language Tasks Using Knowledge Distillation
Task-Attentive Transformer Architecture for Continual Learning of Vision-and-Language Tasks Using Knowledge DistillationConference on Empirical Methods in Natural Language Processing (EMNLP), 2023
Yuliang Cai
Jesse Thomason
Mohammad Rostami
VLMCLL
193
12
0
25 Mar 2023
Learning and Verification of Task Structure in Instructional Videos
Learning and Verification of Task Structure in Instructional Videos
Medhini Narasimhan
Licheng Yu
Sean Bell
Ning Zhang
Trevor Darrell
254
24
0
23 Mar 2023
MELTR: Meta Loss Transformer for Learning to Fine-tune Video Foundation
  Models
MELTR: Meta Loss Transformer for Learning to Fine-tune Video Foundation ModelsComputer Vision and Pattern Recognition (CVPR), 2023
Dohwan Ko
Joon-Young Choi
Hyeong Kyu Choi
Kyoung-Woon On
Byungseok Roh
Hyunwoo J. Kim
226
29
0
23 Mar 2023
CiCo: Domain-Aware Sign Language Retrieval via Cross-Lingual Contrastive
  Learning
CiCo: Domain-Aware Sign Language Retrieval via Cross-Lingual Contrastive LearningComputer Vision and Pattern Recognition (CVPR), 2023
Yiting Cheng
Fangyun Wei
Jianmin Bao
Dong Chen
Wenqian Zhang
SLR
195
40
0
22 Mar 2023
Text with Knowledge Graph Augmented Transformer for Video Captioning
Text with Knowledge Graph Augmented Transformer for Video CaptioningComputer Vision and Pattern Recognition (CVPR), 2023
Xin Gu
G. Chen
Yufei Wang
Libo Zhang
Tiejian Luo
Longyin Wen
211
73
0
22 Mar 2023
Weakly Supervised Video Representation Learning with Unaligned Text for
  Sequential Videos
Weakly Supervised Video Representation Learning with Unaligned Text for Sequential VideosComputer Vision and Pattern Recognition (CVPR), 2023
Sixun Dong
Huazhang Hu
Dongze Lian
Weixin Luo
Yichen Qian
Shenghua Gao
ViTAI4TS
274
18
0
22 Mar 2023
VideoXum: Cross-modal Visual and Textural Summarization of Videos
VideoXum: Cross-modal Visual and Textural Summarization of VideosIEEE transactions on multimedia (IEEE TMM), 2023
Jingyang Lin
Hang Hua
Ming Chen
Yikang Li
Jenhao Hsiao
C. Ho
Jiebo Luo
381
50
0
21 Mar 2023
Transformers in Speech Processing: A Survey
Transformers in Speech Processing: A Survey
S. Latif
Aun Zaidi
Heriberto Cuayáhuitl
Fahad Shamshad
Moazzam Shoukat
Muhammad Usama
Junaid Qadir
448
68
0
21 Mar 2023
Retrieving Multimodal Information for Augmented Generation: A Survey
Retrieving Multimodal Information for Augmented Generation: A SurveyConference on Empirical Methods in Natural Language Processing (EMNLP), 2023
Ruochen Zhao
Hailin Chen
Weishi Wang
Fangkai Jiao
Do Xuan Long
...
Bosheng Ding
Xiaobao Guo
Minzhi Li
Xingxuan Li
Shafiq Joty
411
127
0
20 Mar 2023
Dual-path Adaptation from Image to Video Transformers
Dual-path Adaptation from Image to Video TransformersComputer Vision and Pattern Recognition (CVPR), 2023
Jungin Park
Jiyoung Lee
Kwanghoon Sohn
ViT
250
57
0
17 Mar 2023
Aerial Diffusion: Text Guided Ground-to-Aerial View Translation from a
  Single Image using Diffusion Models
Aerial Diffusion: Text Guided Ground-to-Aerial View Translation from a Single Image using Diffusion Models
D. Kothandaraman
Wanrong Zhu
Ming Lin
Dinesh Manocha
229
6
0
15 Mar 2023
Accommodating Audio Modality in CLIP for Multimodal Processing
Accommodating Audio Modality in CLIP for Multimodal ProcessingAAAI Conference on Artificial Intelligence (AAAI), 2023
Ludan Ruan
Anwen Hu
Yuqing Song
Liang Zhang
S. Zheng
Qin Jin
VLM
179
17
0
12 Mar 2023
Learning Grounded Vision-Language Representation for Versatile
  Understanding in Untrimmed Videos
Learning Grounded Vision-Language Representation for Versatile Understanding in Untrimmed Videos
Teng Wang
Jinrui Zhang
Feng Zheng
Wenhao Jiang
Ran Cheng
Ping Luo
VLM
250
14
0
11 Mar 2023
TQ-Net: Mixed Contrastive Representation Learning For Heterogeneous Test
  Questions
TQ-Net: Mixed Contrastive Representation Learning For Heterogeneous Test Questions
He Zhu
Xihua Li
Xuemin Zhao
Yunbo Cao
Shan Yu
147
0
0
09 Mar 2023
Comparing Trajectory and Vision Modalities for Verb Representation
Comparing Trajectory and Vision Modalities for Verb Representation
Dylan Ebert
Chen Sun
Ellie Pavlick
92
1
0
08 Mar 2023
Grounded Decoding: Guiding Text Generation with Grounded Models for
  Embodied Agents
Grounded Decoding: Guiding Text Generation with Grounded Models for Embodied AgentsNeural Information Processing Systems (NeurIPS), 2023
Wenlong Huang
Fei Xia
Dhruv Shah
Danny Driess
Andy Zeng
...
Pete Florence
Igor Mordatch
Sergey Levine
Karol Hausman
Brian Ichter
LM&Ro
256
78
0
01 Mar 2023
Towards Generalisable Video Moment Retrieval: Visual-Dynamic Injection
  to Image-Text Pre-Training
Towards Generalisable Video Moment Retrieval: Visual-Dynamic Injection to Image-Text Pre-TrainingComputer Vision and Pattern Recognition (CVPR), 2023
Dezhao Luo
Jiabo Huang
S. Gong
Hailin Jin
Yang Liu
VGen
327
42
0
28 Feb 2023
Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense
  Video Captioning
Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense Video CaptioningComputer Vision and Pattern Recognition (CVPR), 2023
Antoine Yang
Arsha Nagrani
Paul Hongsuck Seo
Antoine Miech
Jordi Pont-Tuset
Ivan Laptev
Josef Sivic
Cordelia Schmid
AI4TSVLM
497
325
0
27 Feb 2023
Contrastive Video Question Answering via Video Graph Transformer
Contrastive Video Question Answering via Video Graph TransformerIEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2023
Junbin Xiao
Pan Zhou
Angela Yao
Yicong Li
Richang Hong
Shuicheng Yan
Tat-Seng Chua
ViT
248
51
0
27 Feb 2023
Deep Learning for Video-Text Retrieval: a Review
Deep Learning for Video-Text Retrieval: a ReviewInternational Journal of Multimedia Information Retrieval (IJMIR), 2023
Cunjuan Zhu
Qi Jia
Wei Chen
Yanming Guo
Yu Liu
226
28
0
24 Feb 2023
Large-scale Multi-Modal Pre-trained Models: A Comprehensive Survey
Large-scale Multi-Modal Pre-trained Models: A Comprehensive SurveyMachine Intelligence Research (MIR), 2023
Tianlin Li
Guangyao Chen
Guangwu Qian
Pengcheng Gao
Xiaoyong Wei
Yaowei Wang
Yonghong Tian
Wen Gao
AI4CEVLM
467
272
0
20 Feb 2023
STOA-VLP: Spatial-Temporal Modeling of Object and Action for
  Video-Language Pre-training
STOA-VLP: Spatial-Temporal Modeling of Object and Action for Video-Language Pre-trainingAAAI Conference on Artificial Intelligence (AAAI), 2023
Weihong Zhong
Mao Zheng
Duyu Tang
Xuan Luo
Heng Gong
Xiaocheng Feng
Bing Qin
384
9
0
20 Feb 2023
Hyneter: Hybrid Network Transformer for Object Detection
Hyneter: Hybrid Network Transformer for Object DetectionIEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2023
Dong Chen
Duoqian Miao
Xuepeng Zhao
ViT
193
6
0
18 Feb 2023
Transformadores: Fundamentos teoricos y Aplicaciones
Transformadores: Fundamentos teoricos y Aplicaciones
J. D. L. Torre
293
0
0
18 Feb 2023
Multimodal Subtask Graph Generation from Instructional Videos
Multimodal Subtask Graph Generation from Instructional Videos
Y. Jang
Sungryull Sohn
Lajanugen Logeswaran
Tiange Luo
Moontae Lee
Ho Hin Lee
195
14
0
17 Feb 2023
Hierarchical Cross-modal Transformer for RGB-D Salient Object Detection
Hierarchical Cross-modal Transformer for RGB-D Salient Object Detection
Hao Chen
Feihong Shen
ViT
108
1
0
16 Feb 2023
Multi-modal Machine Learning in Engineering Design: A Review and Future
  Directions
Multi-modal Machine Learning in Engineering Design: A Review and Future DirectionsJournal of Computing and Information Science in Engineering (JCISE), 2023
Binyang Song
Ruilin Zhou
Faez Ahmed
AI4CE
356
64
0
14 Feb 2023
Large Scale Multi-Lingual Multi-Modal Summarization Dataset
Large Scale Multi-Lingual Multi-Modal Summarization DatasetConference of the European Chapter of the Association for Computational Linguistics (EACL), 2023
Yash Verma
Anubhav Jangra
Raghvendra Kumar
S. Saha
114
22
0
13 Feb 2023
BEST: BERT Pre-Training for Sign Language Recognition with Coupling
  Tokenization
BEST: BERT Pre-Training for Sign Language Recognition with Coupling TokenizationAAAI Conference on Artificial Intelligence (AAAI), 2023
Weichao Zhao
Hezhen Hu
Wen-gang Zhou
Jiaxin Shi
Houqiang Li
SLR
273
60
0
10 Feb 2023
AV-data2vec: Self-supervised Learning of Audio-Visual Speech
  Representations with Contextualized Target Representations
AV-data2vec: Self-supervised Learning of Audio-Visual Speech Representations with Contextualized Target RepresentationsAutomatic Speech Recognition & Understanding (ASRU), 2023
Jiachen Lian
Alexei Baevski
Wei-Ning Hsu
Michael Auli
SSL
382
43
0
10 Feb 2023
SwinCross: Cross-modal Swin Transformer for Head-and-Neck Tumor
  Segmentation in PET/CT Images
SwinCross: Cross-modal Swin Transformer for Head-and-Neck Tumor Segmentation in PET/CT ImagesMedical Physics (Lancaster) (Med. Phys.), 2023
Gary Y. Li
Junyu Chen
Se-In Jang
Kuang Gong
Shijie Zhao
ViTMedIm
213
21
0
08 Feb 2023
Program Generation from Diverse Video Demonstrations
Program Generation from Diverse Video DemonstrationsBritish Machine Vision Conference (BMVC), 2023
Anthony Manchin
Jamie Sherrah
Qi Wu
Anton Van Den Hengel
VGen
83
0
0
01 Feb 2023
Semi-Parametric Video-Grounded Text Generation
Semi-Parametric Video-Grounded Text Generation
Sungdong Kim
Jin-Hwa Kim
Jiyoung Lee
Minjoon Seo
VGen
244
17
0
27 Jan 2023
Revisiting Temporal Modeling for CLIP-based Image-to-Video Knowledge
  Transferring
Revisiting Temporal Modeling for CLIP-based Image-to-Video Knowledge TransferringComputer Vision and Pattern Recognition (CVPR), 2023
Ruyang Liu
Jingjia Huang
Ge Li
Jiashi Feng
Xing Wu
Thomas H. Li
AI4TSCLIPVLM
257
74
0
26 Jan 2023
Flow-guided Semi-supervised Video Object Segmentation
Flow-guided Semi-supervised Video Object Segmentation
Yushan Zhang
Andreas Robinson
M. Magnusson
Michael Felsberg
VOS
188
1
0
25 Jan 2023
MultiNet with Transformers: A Model for Cancer Diagnosis Using Images
MultiNet with Transformers: A Model for Cancer Diagnosis Using Images
H. Barzekar
Yash J. Patel
L. Tong
Zeyun Yu
MedIm
181
8
0
21 Jan 2023
Temporal Perceiving Video-Language Pre-training
Temporal Perceiving Video-Language Pre-training
Fan Ma
Xiaojie Jin
Heng Wang
Jingjia Huang
Linchao Zhu
Jiashi Feng
Yi Yang
VLM
206
17
0
18 Jan 2023
A Survey on Self-supervised Learning: Algorithms, Applications, and
  Future Trends
A Survey on Self-supervised Learning: Algorithms, Applications, and Future TrendsIEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2023
Jie Gui
Tuo Chen
Jing Zhang
Qiong Cao
Zhe Sun
Haoran Luo
Dacheng Tao
565
354
0
13 Jan 2023
Self-Attention Amortized Distributional Projection Optimization for
  Sliced Wasserstein Point-Cloud Reconstruction
Self-Attention Amortized Distributional Projection Optimization for Sliced Wasserstein Point-Cloud ReconstructionInternational Conference on Machine Learning (ICML), 2023
Khai Nguyen
Dang Nguyen
N. Ho
166
9
0
12 Jan 2023
Previous
123...567...151617
Next