ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2026 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 1904.01766
  4. Cited By
VideoBERT: A Joint Model for Video and Language Representation Learning
v1v2 (latest)

VideoBERT: A Joint Model for Video and Language Representation Learning

3 April 2019
Chen Sun
Austin Myers
Carl Vondrick
Kevin Patrick Murphy
Cordelia Schmid
    VLMSSL
ArXiv (abs)PDFHTML

Papers citing "VideoBERT: A Joint Model for Video and Language Representation Learning"

50 / 803 papers shown
Large Content And Behavior Models To Understand, Simulate, And Optimize
  Content And Behavior
Large Content And Behavior Models To Understand, Simulate, And Optimize Content And BehaviorInternational Conference on Learning Representations (ICLR), 2023
Ashmit Khandelwal
Aditya Agrawal
Aanisha Bhattacharyya
Yaman Kumar Singla
Somesh Singh
...
Ishita Dasgupta
Stefano Petrangeli
R. Shah
Changyou Chen
Balaji Krishnamurthy
342
10
0
01 Sep 2023
IndGIC: Supervised Action Recognition under Low Illumination
IndGIC: Supervised Action Recognition under Low Illumination
Jing-Teng Zeng
186
3
0
29 Aug 2023
A Multi-Task Semantic Decomposition Framework with Task-specific
  Pre-training for Few-Shot NER
A Multi-Task Semantic Decomposition Framework with Task-specific Pre-training for Few-Shot NERInternational Conference on Information and Knowledge Management (CIKM), 2023
Guanting Dong
Zechen Wang
Jinxu Zhao
Gang Zhao
Daichi Guo
...
Keqing He
Xuefeng Li
Liwen Wang
Xinyue Cui
Weiran Xu
216
23
0
28 Aug 2023
Chunk, Align, Select: A Simple Long-sequence Processing Method for
  Transformers
Chunk, Align, Select: A Simple Long-sequence Processing Method for TransformersAnnual Meeting of the Association for Computational Linguistics (ACL), 2023
Jiawen Xie
Pengyu Cheng
Xiao Liang
Yong Dai
Nan Du
290
15
0
25 Aug 2023
Multi-event Video-Text Retrieval
Multi-event Video-Text RetrievalIEEE International Conference on Computer Vision (ICCV), 2023
Gengyuan Zhang
Jisen Ren
Jindong Gu
Volker Tresp
193
18
0
22 Aug 2023
MusicJam: Visualizing Music Insights via Generated Narrative
  Illustrations
MusicJam: Visualizing Music Insights via Generated Narrative IllustrationsCommunications in Information and Systems (CIS), 2023
Chuer Chen
Nan Cao
Jiani Hou
Yi Guo
Yulei Zhang
Yang Shi
DiffM
200
1
0
22 Aug 2023
Simple Baselines for Interactive Video Retrieval with Questions and
  Answers
Simple Baselines for Interactive Video Retrieval with Questions and AnswersIEEE International Conference on Computer Vision (ICCV), 2023
Kaiqu Liang
Samuel Albanie
200
8
0
21 Aug 2023
Long-range Multimodal Pretraining for Movie Understanding
Long-range Multimodal Pretraining for Movie UnderstandingIEEE International Conference on Computer Vision (ICCV), 2023
Dawit Mureja Argaw
Joon-Young Lee
Markus Woodson
In So Kweon
Fabian Caba Heilbron
VLM
189
14
0
18 Aug 2023
Lip Reading for Low-resource Languages by Learning and Combining General
  Speech Knowledge and Language-specific Knowledge
Lip Reading for Low-resource Languages by Learning and Combining General Speech Knowledge and Language-specific KnowledgeIEEE International Conference on Computer Vision (ICCV), 2023
Minsu Kim
Jeong Hun Yeo
J. Choi
Y. Ro
209
27
0
18 Aug 2023
Diffusion Models for Image Restoration and Enhancement: A Comprehensive Survey
Diffusion Models for Image Restoration and Enhancement: A Comprehensive SurveyInternational Journal of Computer Vision (IJCV), 2023
Xin Li
Yulin Ren
Xin Jin
Cuiling Lan
Xingyu Wang
Wenjun Zeng
Xinchao Wang
Zhibo Chen
369
139
0
18 Aug 2023
BERT4CTR: An Efficient Framework to Combine Pre-trained Language Model
  with Non-textual Features for CTR Prediction
BERT4CTR: An Efficient Framework to Combine Pre-trained Language Model with Non-textual Features for CTR PredictionKnowledge Discovery and Data Mining (KDD), 2023
Dong Wang
Kave Salamatian
Yunqing Xia
Weiwei Deng
Qi Zhang
151
22
0
17 Aug 2023
Tem-adapter: Adapting Image-Text Pretraining for Video Question Answer
Tem-adapter: Adapting Image-Text Pretraining for Video Question AnswerIEEE International Conference on Computer Vision (ICCV), 2023
Guangyi Chen
Xiao Liu
Guangrun Wang
Kun Zhang
Philip H.S.Torr
Xiaoping Zhang
Yansong Tang
293
27
0
16 Aug 2023
AKVSR: Audio Knowledge Empowered Visual Speech Recognition by
  Compressing Audio Knowledge of a Pretrained Model
AKVSR: Audio Knowledge Empowered Visual Speech Recognition by Compressing Audio Knowledge of a Pretrained ModelIEEE transactions on multimedia (IEEE TMM), 2023
Jeong Hun Yeo
Minsu Kim
J. Choi
Dae Hoe Kim
Y. Ro
187
26
0
15 Aug 2023
Cross-Domain Product Representation Learning for Rich-Content E-Commerce
Cross-Domain Product Representation Learning for Rich-Content E-CommerceIEEE International Conference on Computer Vision (ICCV), 2023
Xuehan Bai
Yan Li
Yong Cheng
Wenjie Yang
Quanming Chen
Han Li
169
7
0
10 Aug 2023
MovieChat: From Dense Token to Sparse Memory for Long Video
  Understanding
MovieChat: From Dense Token to Sparse Memory for Long Video UnderstandingComputer Vision and Pattern Recognition (CVPR), 2023
Enxin Song
Wenhao Chai
Guanhong Wang
Yucheng Zhang
Haoyang Zhou
...
Tianbo Ye
Yanting Zhang
Yang Lu
Lei Li
Gaoang Wang
VLMMLLM
620
453
0
31 Jul 2023
AntGPT: Can Large Language Models Help Long-term Action Anticipation
  from Videos?
AntGPT: Can Large Language Models Help Long-term Action Anticipation from Videos?International Conference on Learning Representations (ICLR), 2023
Qi Zhao
Shijie Wang
Ce Zhang
Changcheng Fu
Minh Quan Do
Nakul Agarwal
Kwonjoon Lee
Chen Sun
LM&Ro
388
81
0
31 Jul 2023
FedMEKT: Distillation-based Embedding Knowledge Transfer for Multimodal
  Federated Learning
FedMEKT: Distillation-based Embedding Knowledge Transfer for Multimodal Federated LearningNeural Networks (Neural Netw.), 2023
Huy Q. Le
Minh N. H. Nguyen
Chu Myaet Thwal
Yu Qiao
Chao Zhang
Choong Seon Hong
162
26
0
25 Jul 2023
Does Visual Pretraining Help End-to-End Reasoning?
Does Visual Pretraining Help End-to-End Reasoning?Neural Information Processing Systems (NeurIPS), 2023
Chen Sun
Calvin Luo
Xingyi Zhou
Anurag Arnab
Cordelia Schmid
OCLLRMViT
322
4
0
17 Jul 2023
InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding
  and Generation
InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and GenerationInternational Conference on Learning Representations (ICLR), 2023
Yi Wang
Yinan He
Yizhuo Li
Kunchang Li
Jiashuo Yu
...
Ping Luo
Ziwei Liu
Yali Wang
Limin Wang
Yu Qiao
VLMVGen
364
405
0
13 Jul 2023
EgoVLPv2: Egocentric Video-Language Pre-training with Fusion in the
  Backbone
EgoVLPv2: Egocentric Video-Language Pre-training with Fusion in the BackboneIEEE International Conference on Computer Vision (ICCV), 2023
Shraman Pramanick
Yale Song
Sayan Nag
Kevin Qinghong Lin
Hardik Shah
Mike Zheng Shou
Ramalingam Chellappa
Pengchuan Zhang
VLM
343
133
0
11 Jul 2023
One-Versus-Others Attention: Scalable Multimodal Integration for
  Clinical Data
One-Versus-Others Attention: Scalable Multimodal Integration for Clinical DataPacific Symposium on Biocomputing. Pacific Symposium on Biocomputing (PSB), 2023
Michal Golovanevsky
Eva Schiller
Akira Nair
Ritambhara Singh
Carsten Eickhoff
330
7
0
11 Jul 2023
An Exploratory Literature Study on Sharing and Energy Use of Language
  Models for Source Code
An Exploratory Literature Study on Sharing and Energy Use of Language Models for Source CodeInternational Symposium on Empirical Software Engineering and Measurement (ESEM), 2023
Max Hort
Anastasiia Grishina
Leon Moonen
245
8
0
05 Jul 2023
S-Omninet: Structured Data Enhanced Universal Multimodal Learning
  Architecture
S-Omninet: Structured Data Enhanced Universal Multimodal Learning Architecture
Ye Xue
Diego Klabjan
J. Utke
94
0
0
01 Jul 2023
Mitigating Hallucination in Large Multi-Modal Models via Robust
  Instruction Tuning
Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction TuningInternational Conference on Learning Representations (ICLR), 2023
Fuxiao Liu
Kevin Qinghong Lin
Linjie Li
Jianfeng Wang
Yaser Yacoob
Lijuan Wang
VLMMLLM
427
404
0
26 Jun 2023
Switch-BERT: Learning to Model Multimodal Interactions by Switching
  Attention and Input
Switch-BERT: Learning to Model Multimodal Interactions by Switching Attention and InputEuropean Conference on Computer Vision (ECCV), 2023
Qingpei Guo
Kaisheng Yao
Wei Chu
MLLM
103
6
0
25 Jun 2023
Exploring the Role of Audio in Video Captioning
Exploring the Role of Audio in Video Captioning
Yuhan Shen
Linjie Yang
Longyin Wen
Haichao Yu
Ehsan Elhamifar
Heng Wang
168
6
0
21 Jun 2023
Retrieving-to-Answer: Zero-Shot Video Question Answering with Frozen
  Large Language Models
Retrieving-to-Answer: Zero-Shot Video Question Answering with Frozen Large Language Models
Junting Pan
Ziyi Lin
Yuying Ge
Xiatian Zhu
Renrui Zhang
Yi Wang
Yu Qiao
Jiaming Song
MLLM
177
35
0
15 Jun 2023
Better Generalization with Semantic IDs: A Case Study in Ranking for
  Recommendations
Better Generalization with Semantic IDs: A Case Study in Ranking for RecommendationsACM Conference on Recommender Systems (RecSys), 2023
Anima Singh
Trung Vu
Nikhil Mehta
Raghunandan H. Keshavan
M. Sathiamoorthy
...
Lukasz Heldt
Li Wei
Devansh Tandon
Ed H. Chi
Xinyang Yi
237
56
0
13 Jun 2023
A Survey of Vision-Language Pre-training from the Lens of Multimodal
  Machine Translation
A Survey of Vision-Language Pre-training from the Lens of Multimodal Machine Translation
Jeremy Gwinnup
Kevin Duh
VLM
148
7
0
12 Jun 2023
CD-CTFM: A Lightweight CNN-Transformer Network for Remote Sensing Cloud
  Detection Fusing Multiscale Features
CD-CTFM: A Lightweight CNN-Transformer Network for Remote Sensing Cloud Detection Fusing Multiscale FeaturesIEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing (JSTARS), 2023
Wenhang Ge
Xubing Yang
Li Zhang
184
23
0
12 Jun 2023
Optimizing ViViT Training: Time and Memory Reduction for Action
  Recognition
Optimizing ViViT Training: Time and Memory Reduction for Action Recognition
Shreyank N. Gowda
Anurag Arnab
Jonathan Huang
ViT
182
4
0
07 Jun 2023
Object Detection with Transformers: A Review
Object Detection with Transformers: A ReviewItalian National Conference on Sensors (INS), 2023
Tahira Shehzadi
K. Hashmi
D. Stricker
Muhammad Zeshan Afzal
ViTMU
418
53
0
07 Jun 2023
Learning to Ground Instructional Articles in Videos through Narrations
Learning to Ground Instructional Articles in Videos through NarrationsIEEE International Conference on Computer Vision (ICCV), 2023
E. Mavroudi
Triantafyllos Afouras
Lorenzo Torresani
DiffM
217
27
0
06 Jun 2023
LANISTR: Multimodal Learning from Structured and Unstructured Data
LANISTR: Multimodal Learning from Structured and Unstructured Data
Sayna Ebrahimi
Sercan O. Arik
Yihe Dong
Tomas Pfister
237
7
0
26 May 2023
Denoising Bottleneck with Mutual Information Maximization for Video
  Multimodal Fusion
Denoising Bottleneck with Mutual Information Maximization for Video Multimodal FusionAnnual Meeting of the Association for Computational Linguistics (ACL), 2023
Shao-Yu Wu
Damai Dai
Ziwei Qin
Tianyu Liu
Binghuai Lin
Yunbo Cao
Zhifang Sui
306
17
0
24 May 2023
Exploring Affordance and Situated Meaning in Image Captions: A
  Multimodal Analysis
Exploring Affordance and Situated Meaning in Image Captions: A Multimodal AnalysisPacific Asia Conference on Language, Information and Computation (PACLIC), 2023
Pin-Er Chen
Po-Ya Angela Wang
Hsin-Yu Chou
Yu-Hsiang Tseng
S. Hsieh
91
1
0
24 May 2023
VLAB: Enhancing Video Language Pre-training by Feature Adapting and
  Blending
VLAB: Enhancing Video Language Pre-training by Feature Adapting and BlendingIEEE transactions on multimedia (IEEE TMM), 2023
Xingjian He
Sihan Chen
Fan Ma
Zhicheng Huang
Xiaojie Jin
Zikang Liu
Dongmei Fu
Yi Yang
Qingbin Liu
Jiashi Feng
VLMCLIP
293
23
0
22 May 2023
How does Contrastive Learning Organize Images?
How does Contrastive Learning Organize Images?
Yunzhe Zhang
Yao Lu
Qi Xuan
SSL
163
2
0
17 May 2023
A Video Is Worth 4096 Tokens: Verbalize Videos To Understand Them In
  Zero Shot
A Video Is Worth 4096 Tokens: Verbalize Videos To Understand Them In Zero Shot
Aanisha Bhattacharya
Yaman Kumar Singla
Balaji Krishnamurthy
R. Shah
Changyou Chen
VGen
314
14
0
16 May 2023
Self-Chained Image-Language Model for Video Localization and Question
  Answering
Self-Chained Image-Language Model for Video Localization and Question AnsweringNeural Information Processing Systems (NeurIPS), 2023
Shoubin Yu
Jaemin Cho
Prateek Yadav
Joey Tianyi Zhou
395
199
0
11 May 2023
VideoChat: Chat-Centric Video Understanding
VideoChat: Chat-Centric Video Understanding
Kunchang Li
Yinan He
Yi Wang
Yizhuo Li
Wen Wang
Ping Luo
Yali Wang
Limin Wang
Yu Qiao
MLLM
378
788
0
10 May 2023
SignBERT+: Hand-model-aware Self-supervised Pre-training for Sign
  Language Understanding
SignBERT+: Hand-model-aware Self-supervised Pre-training for Sign Language UnderstandingIEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2023
Hezhen Hu
Weichao Zhao
Wen-gang Zhou
Houqiang Li
ViT
252
118
0
08 May 2023
VideoOFA: Two-Stage Pre-Training for Video-to-Text Generation
VideoOFA: Two-Stage Pre-Training for Video-to-Text Generation
Xilun Chen
L. Yu
Wenhan Xiong
Barlas Ouguz
Yashar Mehdad
Anuj Kumar
VGen
150
4
0
04 May 2023
In-Context Learning Unlocked for Diffusion Models
In-Context Learning Unlocked for Diffusion ModelsNeural Information Processing Systems (NeurIPS), 2023
Zhendong Wang
Lezhi Li
Yadong Lu
Yelong Shen
Pengcheng He
Weizhu Chen
Zinan Lin
Mingyuan Zhou
VLMDiffM
333
96
0
01 May 2023
Early Detection of Alzheimer's Disease using Bottleneck Transformers
Early Detection of Alzheimer's Disease using Bottleneck TransformersInternational Journal of Intelligent Information Technologies (IJIIT), 2022
Arunima Jaiswal
Ananya Sadana
MedIm
140
5
0
01 May 2023
Multimodal Graph Transformer for Multimodal Question Answering
Multimodal Graph Transformer for Multimodal Question AnsweringConference of the European Chapter of the Association for Computational Linguistics (EACL), 2023
Xuehai He
Xin Eric Wang
317
10
0
30 Apr 2023
SViTT: Temporal Learning of Sparse Video-Text Transformers
SViTT: Temporal Learning of Sparse Video-Text TransformersComputer Vision and Pattern Recognition (CVPR), 2023
Yi Li
Kyle Min
Subarna Tripathi
Nuno Vasconcelos
139
18
0
18 Apr 2023
LASER: A Neuro-Symbolic Framework for Learning Spatial-Temporal Scene Graphs with Weak Supervision
LASER: A Neuro-Symbolic Framework for Learning Spatial-Temporal Scene Graphs with Weak SupervisionInternational Conference on Learning Representations (ICLR), 2023
Jiani Huang
Ziyang Li
Mayur Naik
Ser-Nam Lim
667
9
0
15 Apr 2023
How you feelin'? Learning Emotions and Mental States in Movie Scenes
How you feelin'? Learning Emotions and Mental States in Movie ScenesComputer Vision and Pattern Recognition (CVPR), 2023
D. Srivastava
A. Singh
Makarand Tapaswi
226
11
0
12 Apr 2023
CAVL: Learning Contrastive and Adaptive Representations of Vision and
  Language
CAVL: Learning Contrastive and Adaptive Representations of Vision and Language
Shentong Mo
Jingfei Xia
Ihor Markevych
CLIPVLM
199
1
0
10 Apr 2023
Previous
123456...151617
Next