Communities
Connect sessions
AI calendar
Organizations
Join Slack
Contact Sales

Terms and Conditions

Twitter GitHub LinkedIn Bluesky Youtube

© 2026 ResearchTrend.AI, All rights reserved.

Home
Papers
1904.01766
Cited By

VideoBERT: A Joint Model for Video and Language Representation Learning

v1v2 (latest)

VideoBERT: A Joint Model for Video and Language Representation Learning

3 April 2019

Carl Vondrick

Kevin Patrick Murphy

Cordelia Schmid

ArXiv (abs)PDF HTML

Papers citing "VideoBERT: A Joint Model for Video and Language Representation Learning"

50 / 803 papers shown

Large Content And Behavior Models To Understand, Simulate, And Optimize
Content And Behavior

Large Content And Behavior Models To Understand, Simulate, And Optimize Content And BehaviorInternational Conference on Learning Representations (ICLR), 2023

Ashmit Khandelwal

Aanisha Bhattacharyya

Yaman Kumar Singla

...

Ishita Dasgupta

Stefano Petrangeli

Balaji Krishnamurthy

342

10

0

01 Sep 2023

IndGIC: Supervised Action Recognition under Low Illumination

IndGIC: Supervised Action Recognition under Low Illumination

186

3

0

29 Aug 2023

A Multi-Task Semantic Decomposition Framework with Task-specific
Pre-training for Few-Shot NER

A Multi-Task Semantic Decomposition Framework with Task-specific Pre-training for Few-Shot NERInternational Conference on Information and Knowledge Management (CIKM), 2023

...

Weiran Xu

216

23

0

28 Aug 2023

Chunk, Align, Select: A Simple Long-sequence Processing Method for
Transformers

Chunk, Align, Select: A Simple Long-sequence Processing Method for TransformersAnnual Meeting of the Association for Computational Linguistics (ACL), 2023

290

15

0

25 Aug 2023

Multi-event Video-Text Retrieval

Multi-event Video-Text RetrievalIEEE International Conference on Computer Vision (ICCV), 2023

Jindong Gu

193

18

0

22 Aug 2023

MusicJam: Visualizing Music Insights via Generated Narrative
Illustrations

MusicJam: Visualizing Music Insights via Generated Narrative IllustrationsCommunications in Information and Systems (CIS), 2023

Nan Cao

200

1

0

22 Aug 2023

Simple Baselines for Interactive Video Retrieval with Questions and
Answers

Simple Baselines for Interactive Video Retrieval with Questions and AnswersIEEE International Conference on Computer Vision (ICCV), 2023

200

8

0

21 Aug 2023

Long-range Multimodal Pretraining for Movie Understanding

Long-range Multimodal Pretraining for Movie UnderstandingIEEE International Conference on Computer Vision (ICCV), 2023

Dawit Mureja Argaw

In So Kweon

Fabian Caba Heilbron

189

14

0

18 Aug 2023

Lip Reading for Low-resource Languages by Learning and Combining General
Speech Knowledge and Language-specific Knowledge

Lip Reading for Low-resource Languages by Learning and Combining General Speech Knowledge and Language-specific KnowledgeIEEE International Conference on Computer Vision (ICCV), 2023

Jeong Hun Yeo

209

27

0

18 Aug 2023

Diffusion Models for Image Restoration and Enhancement: A Comprehensive Survey

Diffusion Models for Image Restoration and Enhancement: A Comprehensive SurveyInternational Journal of Computer Vision (IJCV), 2023

369

139

0

18 Aug 2023

BERT4CTR: An Efficient Framework to Combine Pre-trained Language Model
with Non-textual Features for CTR Prediction

BERT4CTR: An Efficient Framework to Combine Pre-trained Language Model with Non-textual Features for CTR PredictionKnowledge Discovery and Data Mining (KDD), 2023

Kave Salamatian

151

22

0

17 Aug 2023

Tem-adapter: Adapting Image-Text Pretraining for Video Question Answer

Tem-adapter: Adapting Image-Text Pretraining for Video Question AnswerIEEE International Conference on Computer Vision (ICCV), 2023

Philip H.S.Torr

293

27

0

16 Aug 2023

AKVSR: Audio Knowledge Empowered Visual Speech Recognition by
Compressing Audio Knowledge of a Pretrained Model

AKVSR: Audio Knowledge Empowered Visual Speech Recognition by Compressing Audio Knowledge of a Pretrained ModelIEEE transactions on multimedia (IEEE TMM), 2023

Jeong Hun Yeo

187

26

0

15 Aug 2023

Cross-Domain Product Representation Learning for Rich-Content E-Commerce

Cross-Domain Product Representation Learning for Rich-Content E-CommerceIEEE International Conference on Computer Vision (ICCV), 2023

169

7

0

10 Aug 2023

MovieChat: From Dense Token to Sparse Memory for Long Video
Understanding

MovieChat: From Dense Token to Sparse Memory for Long Video UnderstandingComputer Vision and Pattern Recognition (CVPR), 2023

...

620

453

0

31 Jul 2023

AntGPT: Can Large Language Models Help Long-term Action Anticipation
from Videos?

AntGPT: Can Large Language Models Help Long-term Action Anticipation from Videos?International Conference on Learning Representations (ICLR), 2023

Shijie Wang

388

81

0

31 Jul 2023

FedMEKT: Distillation-based Embedding Knowledge Transfer for Multimodal
Federated Learning

FedMEKT: Distillation-based Embedding Knowledge Transfer for Multimodal Federated LearningNeural Networks (Neural Netw.), 2023

Minh N. H. Nguyen

Chu Myaet Thwal

Yu Qiao

Choong Seon Hong

162

26

0

25 Jul 2023

Does Visual Pretraining Help End-to-End Reasoning?

Does Visual Pretraining Help End-to-End Reasoning?Neural Information Processing Systems (NeurIPS), 2023

Cordelia Schmid

322

4

0

17 Jul 2023

InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding
and Generation

InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and GenerationInternational Conference on Learning Representations (ICLR), 2023

Yi Wang

...

Ping Luo

Ziwei Liu

Yu Qiao

364

405

0

13 Jul 2023

EgoVLPv2: Egocentric Video-Language Pre-training with Fusion in the
Backbone

EgoVLPv2: Egocentric Video-Language Pre-training with Fusion in the BackboneIEEE International Conference on Computer Vision (ICCV), 2023

Shraman Pramanick

Kevin Qinghong Lin

Mike Zheng Shou

Ramalingam Chellappa

Pengchuan Zhang

343

133

0

11 Jul 2023

One-Versus-Others Attention: Scalable Multimodal Integration for
Clinical Data

One-Versus-Others Attention: Scalable Multimodal Integration for Clinical DataPacific Symposium on Biocomputing. Pacific Symposium on Biocomputing (PSB), 2023

Michal Golovanevsky

Ritambhara Singh

Carsten Eickhoff

330

7

0

11 Jul 2023

An Exploratory Literature Study on Sharing and Energy Use of Language
Models for Source Code

An Exploratory Literature Study on Sharing and Energy Use of Language Models for Source CodeInternational Symposium on Empirical Software Engineering and Measurement (ESEM), 2023

Max Hort

Anastasiia Grishina

Leon Moonen

245

8

0

05 Jul 2023

S-Omninet: Structured Data Enhanced Universal Multimodal Learning
Architecture

S-Omninet: Structured Data Enhanced Universal Multimodal Learning Architecture

94

0

0

01 Jul 2023

Mitigating Hallucination in Large Multi-Modal Models via Robust
Instruction Tuning

Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction TuningInternational Conference on Learning Representations (ICLR), 2023

Fuxiao Liu

Kevin Qinghong Lin

427

404

0

26 Jun 2023

Switch-BERT: Learning to Model Multimodal Interactions by Switching
Attention and Input

Switch-BERT: Learning to Model Multimodal Interactions by Switching Attention and InputEuropean Conference on Computer Vision (ECCV), 2023

103

6

0

25 Jun 2023

Exploring the Role of Audio in Video Captioning

Exploring the Role of Audio in Video Captioning

Linjie Yang

Ehsan Elhamifar

Heng Wang

168

6

0

21 Jun 2023

Retrieving-to-Answer: Zero-Shot Video Question Answering with Frozen
Large Language Models

Retrieving-to-Answer: Zero-Shot Video Question Answering with Frozen Large Language Models

Yi Wang

Yu Qiao

177

35

0

15 Jun 2023

Better Generalization with Semantic IDs: A Case Study in Ranking for
Recommendations

Better Generalization with Semantic IDs: A Case Study in Ranking for RecommendationsACM Conference on Recommender Systems (RecSys), 2023

Raghunandan H. Keshavan

M. Sathiamoorthy

...

237

56

0

13 Jun 2023

A Survey of Vision-Language Pre-training from the Lens of Multimodal
Machine Translation

A Survey of Vision-Language Pre-training from the Lens of Multimodal Machine Translation

148

7

0

12 Jun 2023

CD-CTFM: A Lightweight CNN-Transformer Network for Remote Sensing Cloud
Detection Fusing Multiscale Features

CD-CTFM: A Lightweight CNN-Transformer Network for Remote Sensing Cloud Detection Fusing Multiscale FeaturesIEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing (JSTARS), 2023

184

23

0

12 Jun 2023

Optimizing ViViT Training: Time and Memory Reduction for Action
Recognition

Optimizing ViViT Training: Time and Memory Reduction for Action Recognition

Shreyank N. Gowda

182

4

0

07 Jun 2023

Object Detection with Transformers: A Review

Object Detection with Transformers: A ReviewItalian National Conference on Sensors (INS), 2023

Tahira Shehzadi

Muhammad Zeshan Afzal

418

53

0

07 Jun 2023

Learning to Ground Instructional Articles in Videos through Narrations

Learning to Ground Instructional Articles in Videos through NarrationsIEEE International Conference on Computer Vision (ICCV), 2023

Triantafyllos Afouras

Lorenzo Torresani

217

27

0

06 Jun 2023

LANISTR: Multimodal Learning from Structured and Unstructured Data

LANISTR: Multimodal Learning from Structured and Unstructured Data

Tomas Pfister

237

7

0

26 May 2023

Denoising Bottleneck with Mutual Information Maximization for Video
Multimodal Fusion

Denoising Bottleneck with Mutual Information Maximization for Video Multimodal FusionAnnual Meeting of the Association for Computational Linguistics (ACL), 2023

Zhifang Sui

306

17

0

24 May 2023

Exploring Affordance and Situated Meaning in Image Captions: A
Multimodal Analysis

Exploring Affordance and Situated Meaning in Image Captions: A Multimodal AnalysisPacific Asia Conference on Language, Information and Computation (PACLIC), 2023

Po-Ya Angela Wang

Yu-Hsiang Tseng

91

1

0

24 May 2023

VLAB: Enhancing Video Language Pre-training by Feature Adapting and
Blending

VLAB: Enhancing Video Language Pre-training by Feature Adapting and BlendingIEEE transactions on multimedia (IEEE TMM), 2023

Yi Yang

293

23

0

22 May 2023

How does Contrastive Learning Organize Images?

How does Contrastive Learning Organize Images?

Yao Lu

163

2

0

17 May 2023

A Video Is Worth 4096 Tokens: Verbalize Videos To Understand Them In
Zero Shot

A Video Is Worth 4096 Tokens: Verbalize Videos To Understand Them In Zero Shot

Aanisha Bhattacharya

Yaman Kumar Singla

Balaji Krishnamurthy

314

14

0

16 May 2023

Self-Chained Image-Language Model for Video Localization and Question
Answering

Self-Chained Image-Language Model for Video Localization and Question AnsweringNeural Information Processing Systems (NeurIPS), 2023

Joey Tianyi Zhou

395

199

0

11 May 2023

VideoChat: Chat-Centric Video Understanding

VideoChat: Chat-Centric Video Understanding

Yi Wang

Ping Luo

Yu Qiao

378

788

0

10 May 2023

SignBERT+: Hand-model-aware Self-supervised Pre-training for Sign
Language Understanding

SignBERT+: Hand-model-aware Self-supervised Pre-training for Sign Language UnderstandingIEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2023

252

118

0

08 May 2023

VideoOFA: Two-Stage Pre-Training for Video-to-Text Generation

VideoOFA: Two-Stage Pre-Training for Video-to-Text Generation

Yashar Mehdad

150

4

0

04 May 2023

In-Context Learning Unlocked for Diffusion Models

In-Context Learning Unlocked for Diffusion ModelsNeural Information Processing Systems (NeurIPS), 2023

Mingyuan Zhou

333

96

0

01 May 2023

Early Detection of Alzheimer's Disease using Bottleneck Transformers

Early Detection of Alzheimer's Disease using Bottleneck TransformersInternational Journal of Intelligent Information Technologies (IJIIT), 2022

Arunima Jaiswal

140

5

0

01 May 2023

Multimodal Graph Transformer for Multimodal Question Answering

Multimodal Graph Transformer for Multimodal Question AnsweringConference of the European Chapter of the Association for Computational Linguistics (EACL), 2023

317

10

0

30 Apr 2023

SViTT: Temporal Learning of Sparse Video-Text Transformers

SViTT: Temporal Learning of Sparse Video-Text TransformersComputer Vision and Pattern Recognition (CVPR), 2023

Subarna Tripathi

Nuno Vasconcelos

139

18

0

18 Apr 2023

LASER: A Neuro-Symbolic Framework for Learning Spatial-Temporal Scene Graphs with Weak Supervision

LASER: A Neuro-Symbolic Framework for Learning Spatial-Temporal Scene Graphs with Weak SupervisionInternational Conference on Learning Representations (ICLR), 2023

Ser-Nam Lim

667

9

0

15 Apr 2023

How you feelin'? Learning Emotions and Mental States in Movie Scenes

How you feelin'? Learning Emotions and Mental States in Movie ScenesComputer Vision and Pattern Recognition (CVPR), 2023

Makarand Tapaswi

226

11

0

12 Apr 2023

CAVL: Learning Contrastive and Adaptive Representations of Vision and
Language

CAVL: Learning Contrastive and Adaptive Representations of Vision and Language

Shentong Mo

199

1

0

10 Apr 2023

1 2 3 4 5 6...15 16 17