Communities
Connect sessions
AI calendar
Organizations
Join Slack
Contact Sales

Terms and Conditions

Twitter GitHub LinkedIn Bluesky Youtube

© 2026 ResearchTrend.AI, All rights reserved.

Home
Papers
1904.01766
Cited By

VideoBERT: A Joint Model for Video and Language Representation Learning

v1v2 (latest)

VideoBERT: A Joint Model for Video and Language Representation Learning

3 April 2019

Carl Vondrick

Kevin Patrick Murphy

Cordelia Schmid

ArXiv (abs)PDF HTML

Papers citing "VideoBERT: A Joint Model for Video and Language Representation Learning"

50 / 803 papers shown

ST-Adapter: Parameter-Efficient Image-to-Video Transfer Learning

ST-Adapter: Parameter-Efficient Image-to-Video Transfer LearningNeural Information Processing Systems (NeurIPS), 2022

404

266

0

27 Jun 2022

Semantic Role Aware Correlation Transformer for Text to Video Retrieval

Semantic Role Aware Correlation Transformer for Text to Video RetrievalInternational Conference on Information Photonics (ICIP), 2021

146

12

0

26 Jun 2022

RoME: Role-aware Mixture-of-Expert Transformer for Text-to-Video
Retrieval

RoME: Role-aware Mixture-of-Expert Transformer for Text-to-Video Retrieval

173

13

0

26 Jun 2022

Do Trajectories Encode Verb Meaning?

Do Trajectories Encode Verb Meaning?North American Chapter of the Association for Computational Linguistics (NAACL), 2022

206

2

0

23 Jun 2022

Self-Supervised Learning for Videos: A Survey

Self-Supervised Learning for Videos: A SurveyACM Computing Surveys (ACM CSUR), 2022

Madeline Chantry Schiappa

480

168

0

18 Jun 2022

Zero-Shot Video Question Answering via Frozen Bidirectional Language
Models

Zero-Shot Video Question Answering via Frozen Bidirectional Language ModelsNeural Information Processing Systems (NeurIPS), 2022

Cordelia Schmid

485

277

0

16 Jun 2022

VCT: A Video Compression Transformer

VCT: A Video Compression TransformerNeural Information Processing Systems (NeurIPS), 2022

David C. Minnen

206

129

0

15 Jun 2022

LAVENDER: Unifying Video-Language Understanding as Masked Language
Modeling

LAVENDER: Unifying Video-Language Understanding as Masked Language ModelingComputer Vision and Pattern Recognition (CVPR), 2022

Kevin Qinghong Lin

Chung-Ching Lin

Zicheng Liu

207

94

0

14 Jun 2022

LIFT: Language-Interfaced Fine-Tuning for Non-Language Machine Learning
Tasks

LIFT: Language-Interfaced Fine-Tuning for Non-Language Machine Learning TasksNeural Information Processing Systems (NeurIPS), 2022

Shashank Rajput

Dimitris Papailiopoulos

576

172

0

14 Jun 2022

Multimodal Learning with Transformers: A Survey

Multimodal Learning with Transformers: A SurveyIEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2022

571

846

0

13 Jun 2022

Revealing Single Frame Bias for Video-and-Language Learning

Revealing Single Frame Bias for Video-and-Language LearningAnnual Meeting of the Association for Computational Linguistics (ACL), 2022

Joey Tianyi Zhou

239

142

0

07 Jun 2022

Beyond Just Vision: A Review on Self-Supervised Representation Learning
on Multimodal and Temporal Data

Beyond Just Vision: A Review on Self-Supervised Representation Learning on Multimodal and Temporal Data

Shohreh Deldari

Daniel V. Smith

257

43

0

06 Jun 2022

Revisiting the "Video" in Video-Language Understanding

Revisiting the "Video" in Video-Language UnderstandingComputer Vision and Pattern Recognition (CVPR), 2022

Cristobal Eyzaguirre

Adrien Gaidon

Jiajun Wu

Juan Carlos Niebles

216

202

0

03 Jun 2022

Egocentric Video-Language Pretraining

Egocentric Video-Language PretrainingNeural Information Processing Systems (NeurIPS), 2022

Kevin Qinghong Lin

Alex Jinpeng Wang

Rui Yan

...

Hongfa Wang

Dima Damen

Wei Liu

Mike Zheng Shou

268

249

0

03 Jun 2022

TransFuser: Imitation with Transformer-Based Sensor Fusion for
Autonomous Driving

TransFuser: Imitation with Transformer-Based Sensor Fusion for Autonomous DrivingIEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2022

Bernhard Jaeger

Zehao Yu

615

522

0

31 May 2022

GIT: A Generative Image-to-text Transformer for Vision and Language

GIT: A Generative Image-to-text Transformer for Vision and Language

Kevin Qinghong Lin

Zicheng Liu

613

714

0

27 May 2022

Sample-Efficient Optimisation with Probabilistic Transformer Surrogates

Sample-Efficient Optimisation with Probabilistic Transformer Surrogates

Matthieu Zimmer

Antoine Grosnit

Jun Wang

179

2

0

27 May 2022

VTP: Volumetric Transformer for Multi-view Multi-person 3D Pose
Estimation

VTP: Volumetric Transformer for Multi-view Multi-person 3D Pose Estimation

Gangyong Jia

205

13

0

25 May 2022

Multimodal Conversational AI: A Survey of Datasets and Approaches

Multimodal Conversational AI: A Survey of Datasets and Approaches

Anirudh S. Sundar

166

33

0

13 May 2022

Learning to Answer Visual Questions from Web Videos

Learning to Answer Visual Questions from Web VideosIEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2022

Cordelia Schmid

330

39

0

10 May 2022

Chart Question Answering: State of the Art and Future Directions

Chart Question Answering: State of the Art and Future Directions

163

53

0

08 May 2022

Cross-modal Representation Learning for Zero-shot Action Recognition

Cross-modal Representation Learning for Zero-shot Action RecognitionComputer Vision and Pattern Recognition (CVPR), 2022

Chung-Ching Lin

Kevin Qinghong Lin

Zicheng Liu

152

30

0

03 May 2022

Answer-Me: Multi-Task Open-Vocabulary Visual Question Answering

Answer-Me: Multi-Task Open-Vocabulary Visual Question Answering

A. Piergiovanni

281

18

0

02 May 2022

CenterCLIP: Token Clustering for Efficient Text-Video Retrieval

CenterCLIP: Token Clustering for Efficient Text-Video RetrievalAnnual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), 2022

206

153

0

02 May 2022

Flamingo: a Visual Language Model for Few-Shot Learning

Flamingo: a Visual Language Model for Few-Shot LearningNeural Information Processing Systems (NeurIPS), 2022

Jean-Baptiste Alayrac

...

Mikolaj Binkowski

Ricardo Barreira

Andrew Zisserman

713

4,954

0

29 Apr 2022

Where in the World is this Image? Transformer-based Geo-localization in
the Wild

Where in the World is this Image? Transformer-based Geo-localization in the WildEuropean Conference on Computer Vision (ECCV), 2022

Shraman Pramanick

Carlos D. Castillo

212

60

0

29 Apr 2022

Relevance-based Margin for Contrastively-trained Video Retrieval Models

Relevance-based Margin for Contrastively-trained Video Retrieval ModelsInternational Conference on Multimedia Retrieval (ICMR), 2022

Swathikiran Sudhakaran

Sergio Escalera

371

10

0

27 Apr 2022

MILES: Visual BERT Pre-training with Injected Language Semantics for
Video-text Retrieval

MILES: Visual BERT Pre-training with Injected Language Semantics for Video-text RetrievalEuropean Conference on Computer Vision (ECCV), 2022

Alex Jinpeng Wang

Ying Shan

Ping Luo

162

48

0

26 Apr 2022

Contrastive Language-Action Pre-training for Temporal Localization

Contrastive Language-Action Pre-training for Temporal Localization

199

25

0

26 Apr 2022

Training and challenging models for text-guided fashion image retrieval

Training and challenging models for text-guided fashion image retrieval

Gaurav Srivastava

150

10

0

23 Apr 2022

A Multi-level Alignment Training Scheme for Video-and-Language Grounding

A Multi-level Alignment Training Scheme for Video-and-Language Grounding

Govind Thattai

219

2

0

22 Apr 2022

Imagination-Augmented Natural Language Understanding

Imagination-Augmented Natural Language UnderstandingNorth American Chapter of the Association for Computational Linguistics (NAACL), 2022

Miguel P. Eckstein

William Yang Wang

227

25

0

18 Apr 2022

Modality-Balanced Embedding for Video Retrieval

Modality-Balanced Embedding for Video RetrievalAnnual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), 2022

144

12

0

18 Apr 2022

End-to-end Dense Video Captioning as Sequence Generation

End-to-end Dense Video Captioning as Sequence GenerationInternational Conference on Computational Linguistics (COLING), 2022

Ashish V. Thapliyal

William Yang Wang

227

44

0

18 Apr 2022

Probabilistic Compositional Embeddings for Multimodal Image Retrieval

Probabilistic Compositional Embeddings for Multimodal Image Retrieval

275

44

0

12 Apr 2022

Are Multimodal Transformers Robust to Missing Modality?

Are Multimodal Transformers Robust to Missing Modality?Computer Vision and Pattern Recognition (CVPR), 2022

Davide Testuggine

316

214

0

12 Apr 2022

Hierarchical Self-supervised Representation Learning for Movie
Understanding

Hierarchical Self-supervised Representation Learning for Movie UnderstandingComputer Vision and Pattern Recognition (CVPR), 2022

202

27

0

06 Apr 2022

Modeling Motion with Multi-Modal Features for Text-Based Video
Segmentation

Modeling Motion with Multi-Modal Features for Text-Based Video SegmentationComputer Vision and Pattern Recognition (CVPR), 2022

Yang You

243

30

0

06 Apr 2022

Long Movie Clip Classification with State-Space Video Models

Long Movie Clip Classification with State-Space Video ModelsEuropean Conference on Computer Vision (ECCV), 2022

Md. Mohaiminul Islam

Gedas Bertasius

436

140

0

04 Apr 2022

Do As I Can, Not As I Say: Grounding Language in Robotic Affordances

Do As I Can, Not As I Say: Grounding Language in Robotic AffordancesConference on Robot Learning (CoRL), 2022

Yevgen Chebotar

...

595

2,634

0

04 Apr 2022

Do Vision-Language Pretrained Models Learn Composable Primitive
Concepts?

Do Vision-Language Pretrained Models Learn Composable Primitive Concepts?

ReLM CoGe VLM LRM

279

35

0

31 Mar 2022

Video-Text Representation Learning via Differentiable Weak Temporal
Alignment

Video-Text Representation Learning via Differentiable Weak Temporal AlignmentComputer Vision and Pattern Recognition (CVPR), 2022

173

27

0

31 Mar 2022

TubeDETR: Spatio-Temporal Video Grounding with Transformers

TubeDETR: Spatio-Temporal Video Grounding with TransformersComputer Vision and Pattern Recognition (CVPR), 2022

Cordelia Schmid

341

121

0

30 Mar 2022

Locality-Aware Inter-and Intra-Video Reconstruction for Self-Supervised
Correspondence Learning

Locality-Aware Inter-and Intra-Video Reconstruction for Self-Supervised Correspondence LearningComputer Vision and Pattern Recognition (CVPR), 2022

292

56

0

27 Mar 2022

Audio-Adaptive Activity Recognition Across Video Domains

Audio-Adaptive Activity Recognition Across Video DomainsComputer Vision and Pattern Recognition (CVPR), 2022

Cees G. M. Snoek

193

50

0

27 Mar 2022

MQDD: Pre-training of Multimodal Question Duplicity Detection for
Software Engineering Domain

MQDD: Pre-training of Multimodal Question Duplicity Detection for Software Engineering DomainRecent Advances in Natural Language Processing (RANLP), 2022

Miloslav Konopík

203

1

0

26 Mar 2022

Give Me Your Attention: Dot-Product Attention Considered Harmful for
Adversarial Patch Robustness

Give Me Your Attention: Dot-Product Attention Considered Harmful for Adversarial Patch RobustnessComputer Vision and Pattern Recognition (CVPR), 2022

Giulio Lovisotto

Mauricio Muñoz

Chaithanya Kumar Mummadi

141

49

0

25 Mar 2022

Reshaping Robot Trajectories Using Natural Language Commands: A Study of
Multi-Modal Data Alignment Using Transformers

Reshaping Robot Trajectories Using Natural Language Commands: A Study of Multi-Modal Data Alignment Using TransformersIEEE/RJS International Conference on Intelligent RObots and Systems (IROS), 2022

Luis F. C. Figueredo

Rogerio Bonatti

243

61

0

25 Mar 2022

LocATe: End-to-end Localization of Actions in 3D with Transformers

LocATe: End-to-end Localization of Actions in 3D with Transformers

Michael J. Black

Arjun Chandrasekaran

263

9

0

21 Mar 2022

Local-Global Context Aware Transformer for Language-Guided Video
Segmentation

Local-Global Context Aware Transformer for Language-Guided Video SegmentationIEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2022

325

101

0

18 Mar 2022

1 2 3...8 9 10...15 16 17

Page 9 of 17

Pageof 17