Communities
Connect sessions
AI calendar
Organizations
Join Slack
Contact Sales

Terms and Conditions

Twitter GitHub LinkedIn Bluesky Youtube

© 2026 ResearchTrend.AI, All rights reserved.

Home
Papers
1904.01766
Cited By

VideoBERT: A Joint Model for Video and Language Representation Learning

v1v2 (latest)

VideoBERT: A Joint Model for Video and Language Representation Learning

3 April 2019

Carl Vondrick

Kevin Patrick Murphy

Cordelia Schmid

ArXiv (abs)PDF HTML

Papers citing "VideoBERT: A Joint Model for Video and Language Representation Learning"

50 / 803 papers shown

Video ChatCaptioner: Towards Enriched Spatiotemporal Descriptions

Video ChatCaptioner: Towards Enriched Spatiotemporal Descriptions

Kilichbek Haydarov

Mohamed Elhoseiny

264

44

0

09 Apr 2023

Scalable and Accurate Self-supervised Multimodal Representation Learning
without Aligned Video and Text Data

Scalable and Accurate Self-supervised Multimodal Representation Learning without Aligned Video and Text Data

Vladislav Lialin

267

8

0

04 Apr 2023

Beyond Unimodal: Generalising Neural Processes for Multimodal
Uncertainty Estimation

Beyond Unimodal: Generalising Neural Processes for Multimodal Uncertainty EstimationNeural Information Processing Systems (NeurIPS), 2023

259

11

0

04 Apr 2023

Unbiased Scene Graph Generation in Videos

Unbiased Scene Graph Generation in VideosComputer Vision and Pattern Recognition (CVPR), 2023

Subarna Tripathi

Amit K. Roy-Chowdhury

428

40

0

03 Apr 2023

Procedure-Aware Pretraining for Instructional Video Understanding

Procedure-Aware Pretraining for Instructional Video UnderstandingComputer Vision and Pattern Recognition (CVPR), 2023

Honglu Zhou

Roberto Martín-Martín

Mubbasir Kapadia

Silvio Savarese

Juan Carlos Niebles

290

55

0

31 Mar 2023

Self-Supervised Multimodal Learning: A Survey

Self-Supervised Multimodal Learning: A SurveyIEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2023

Oisin Mac Aodha

Timothy M. Hospedales

319

89

0

31 Mar 2023

Learning Procedure-aware Video Representation from Instructional Videos
and Their Narrations

Learning Procedure-aware Video Representation from Instructional Videos and Their NarrationsComputer Vision and Pattern Recognition (CVPR), 2023

236

46

0

31 Mar 2023

Dual Cross-Attention for Medical Image Segmentation

Dual Cross-Attention for Medical Image SegmentationEngineering applications of artificial intelligence (Eng. Appl. Artif. Intell.), 2023

Gorkem Can Ates

164

137

0

30 Mar 2023

Object Discovery from Motion-Guided Tokens

Object Discovery from Motion-Guided TokensComputer Vision and Pattern Recognition (CVPR), 2023

Adrien Gaidon

204

28

0

27 Mar 2023

RGBT Tracking via Progressive Fusion Transformer with Dynamically Guided Learning

Chenglong Li

219

15

0

26 Mar 2023

Selective Structured State-Spaces for Long-Form Video Understanding

Selective Structured State-Spaces for Long-Form Video UnderstandingComputer Vision and Pattern Recognition (CVPR), 2023

208

159

0

25 Mar 2023

Task-Attentive Transformer Architecture for Continual Learning of
Vision-and-Language Tasks Using Knowledge Distillation

Task-Attentive Transformer Architecture for Continual Learning of Vision-and-Language Tasks Using Knowledge DistillationConference on Empirical Methods in Natural Language Processing (EMNLP), 2023

Mohammad Rostami

193

12

0

25 Mar 2023

Learning and Verification of Task Structure in Instructional Videos

Learning and Verification of Task Structure in Instructional Videos

Medhini Narasimhan

254

24

0

23 Mar 2023

MELTR: Meta Loss Transformer for Learning to Fine-tune Video Foundation
Models

MELTR: Meta Loss Transformer for Learning to Fine-tune Video Foundation ModelsComputer Vision and Pattern Recognition (CVPR), 2023

Joon-Young Choi

Hyeong Kyu Choi

226

29

0

23 Mar 2023

CiCo: Domain-Aware Sign Language Retrieval via Cross-Lingual Contrastive
Learning

CiCo: Domain-Aware Sign Language Retrieval via Cross-Lingual Contrastive LearningComputer Vision and Pattern Recognition (CVPR), 2023

Jianmin Bao

195

40

0

22 Mar 2023

Text with Knowledge Graph Augmented Transformer for Video Captioning

Text with Knowledge Graph Augmented Transformer for Video CaptioningComputer Vision and Pattern Recognition (CVPR), 2023

Yufei Wang

211

73

0

22 Mar 2023

Weakly Supervised Video Representation Learning with Unaligned Text for
Sequential Videos

Weakly Supervised Video Representation Learning with Unaligned Text for Sequential VideosComputer Vision and Pattern Recognition (CVPR), 2023

274

18

0

22 Mar 2023

VideoXum: Cross-modal Visual and Textural Summarization of Videos

VideoXum: Cross-modal Visual and Textural Summarization of VideosIEEE transactions on multimedia (IEEE TMM), 2023

381

50

0

21 Mar 2023

Transformers in Speech Processing: A Survey

Transformers in Speech Processing: A Survey

Heriberto Cuayáhuitl

Moazzam Shoukat

448

68

0

21 Mar 2023

Retrieving Multimodal Information for Augmented Generation: A Survey

Retrieving Multimodal Information for Augmented Generation: A SurveyConference on Empirical Methods in Natural Language Processing (EMNLP), 2023

Hailin Chen

...

411

127

0

20 Mar 2023

Dual-path Adaptation from Image to Video Transformers

Dual-path Adaptation from Image to Video TransformersComputer Vision and Pattern Recognition (CVPR), 2023

250

57

0

17 Mar 2023

Aerial Diffusion: Text Guided Ground-to-Aerial View Translation from a
Single Image using Diffusion Models

Aerial Diffusion: Text Guided Ground-to-Aerial View Translation from a Single Image using Diffusion Models

D. Kothandaraman

229

6

0

15 Mar 2023

Accommodating Audio Modality in CLIP for Multimodal Processing

Accommodating Audio Modality in CLIP for Multimodal ProcessingAAAI Conference on Artificial Intelligence (AAAI), 2023

Qin Jin

179

17

0

12 Mar 2023

Learning Grounded Vision-Language Representation for Versatile
Understanding in Untrimmed Videos

Learning Grounded Vision-Language Representation for Versatile Understanding in Untrimmed Videos

Ran Cheng

Ping Luo

250

14

0

11 Mar 2023

TQ-Net: Mixed Contrastive Representation Learning For Heterogeneous Test
Questions

TQ-Net: Mixed Contrastive Representation Learning For Heterogeneous Test Questions

147

0

0

09 Mar 2023

Comparing Trajectory and Vision Modalities for Verb Representation

Comparing Trajectory and Vision Modalities for Verb Representation

92

1

0

08 Mar 2023

Grounded Decoding: Guiding Text Generation with Grounded Models for
Embodied Agents

Grounded Decoding: Guiding Text Generation with Grounded Models for Embodied AgentsNeural Information Processing Systems (NeurIPS), 2023

Wenlong Huang

...

256

78

0

01 Mar 2023

Towards Generalisable Video Moment Retrieval: Visual-Dynamic Injection
to Image-Text Pre-Training

Towards Generalisable Video Moment Retrieval: Visual-Dynamic Injection to Image-Text Pre-TrainingComputer Vision and Pattern Recognition (CVPR), 2023

Yang Liu

327

42

0

28 Feb 2023

Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense
Video Captioning

Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense Video CaptioningComputer Vision and Pattern Recognition (CVPR), 2023

Paul Hongsuck Seo

Jordi Pont-Tuset

Cordelia Schmid

497

325

0

27 Feb 2023

Contrastive Video Question Answering via Video Graph Transformer

Contrastive Video Question Answering via Video Graph TransformerIEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2023

Angela Yao

248

51

0

27 Feb 2023

Deep Learning for Video-Text Retrieval: a Review

Deep Learning for Video-Text Retrieval: a ReviewInternational Journal of Multimedia Information Retrieval (IJMIR), 2023

226

28

0

24 Feb 2023

Large-scale Multi-Modal Pre-trained Models: A Comprehensive Survey

Large-scale Multi-Modal Pre-trained Models: A Comprehensive SurveyMachine Intelligence Research (MIR), 2023

Yaowei Wang

Yonghong Tian

467

272

0

20 Feb 2023

STOA-VLP: Spatial-Temporal Modeling of Object and Action for
Video-Language Pre-training

STOA-VLP: Spatial-Temporal Modeling of Object and Action for Video-Language Pre-trainingAAAI Conference on Artificial Intelligence (AAAI), 2023

384

9

0

20 Feb 2023

Hyneter: Hybrid Network Transformer for Object Detection

Hyneter: Hybrid Network Transformer for Object DetectionIEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2023

193

6

0

18 Feb 2023

Transformadores: Fundamentos teoricos y Aplicaciones

Transformadores: Fundamentos teoricos y Aplicaciones

293

0

0

18 Feb 2023

Multimodal Subtask Graph Generation from Instructional Videos

Multimodal Subtask Graph Generation from Instructional Videos

Lajanugen Logeswaran

195

14

0

17 Feb 2023

Hierarchical Cross-modal Transformer for RGB-D Salient Object Detection

Hierarchical Cross-modal Transformer for RGB-D Salient Object Detection

Hao Chen

108

1

0

16 Feb 2023

Multi-modal Machine Learning in Engineering Design: A Review and Future
Directions

Multi-modal Machine Learning in Engineering Design: A Review and Future DirectionsJournal of Computing and Information Science in Engineering (JCISE), 2023

356

64

0

14 Feb 2023

Large Scale Multi-Lingual Multi-Modal Summarization Dataset

Large Scale Multi-Lingual Multi-Modal Summarization DatasetConference of the European Chapter of the Association for Computational Linguistics (EACL), 2023

Raghvendra Kumar

114

22

0

13 Feb 2023

BEST: BERT Pre-Training for Sign Language Recognition with Coupling
Tokenization

BEST: BERT Pre-Training for Sign Language Recognition with Coupling TokenizationAAAI Conference on Artificial Intelligence (AAAI), 2023

273

60

0

10 Feb 2023

AV-data2vec: Self-supervised Learning of Audio-Visual Speech
Representations with Contextualized Target Representations

AV-data2vec: Self-supervised Learning of Audio-Visual Speech Representations with Contextualized Target RepresentationsAutomatic Speech Recognition & Understanding (ASRU), 2023

382

43

0

10 Feb 2023

SwinCross: Cross-modal Swin Transformer for Head-and-Neck Tumor
Segmentation in PET/CT Images

SwinCross: Cross-modal Swin Transformer for Head-and-Neck Tumor Segmentation in PET/CT ImagesMedical Physics (Lancaster) (Med. Phys.), 2023

213

21

0

08 Feb 2023

Program Generation from Diverse Video Demonstrations

Program Generation from Diverse Video DemonstrationsBritish Machine Vision Conference (BMVC), 2023

Anthony Manchin

Qi Wu

Anton Van Den Hengel

83

0

0

01 Feb 2023

Semi-Parametric Video-Grounded Text Generation

Semi-Parametric Video-Grounded Text Generation

244

17

0

27 Jan 2023

Revisiting Temporal Modeling for CLIP-based Image-to-Video Knowledge
Transferring

Revisiting Temporal Modeling for CLIP-based Image-to-Video Knowledge TransferringComputer Vision and Pattern Recognition (CVPR), 2023

257

74

0

26 Jan 2023

Flow-guided Semi-supervised Video Object Segmentation

Flow-guided Semi-supervised Video Object Segmentation

Andreas Robinson

Michael Felsberg

188

1

0

25 Jan 2023

MultiNet with Transformers: A Model for Cancer Diagnosis Using Images

MultiNet with Transformers: A Model for Cancer Diagnosis Using Images

181

8

0

21 Jan 2023

Temporal Perceiving Video-Language Pre-training

Temporal Perceiving Video-Language Pre-training

Heng Wang

Yi Yang

206

17

0

18 Jan 2023

A Survey on Self-supervised Learning: Algorithms, Applications, and
Future Trends

A Survey on Self-supervised Learning: Algorithms, Applications, and Future TrendsIEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2023

565

354

0

13 Jan 2023

Self-Attention Amortized Distributional Projection Optimization for
Sliced Wasserstein Point-Cloud Reconstruction

Self-Attention Amortized Distributional Projection Optimization for Sliced Wasserstein Point-Cloud ReconstructionInternational Conference on Machine Learning (ICML), 2023

Dang Nguyen

166

9

0

12 Jan 2023

1 2 3...5 6 7...15 16 17