ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2026 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 1904.01766
  4. Cited By
VideoBERT: A Joint Model for Video and Language Representation Learning
v1v2 (latest)

VideoBERT: A Joint Model for Video and Language Representation Learning

3 April 2019
Chen Sun
Austin Myers
Carl Vondrick
Kevin Patrick Murphy
Cordelia Schmid
    VLMSSL
ArXiv (abs)PDFHTML

Papers citing "VideoBERT: A Joint Model for Video and Language Representation Learning"

50 / 803 papers shown
ST-Adapter: Parameter-Efficient Image-to-Video Transfer Learning
ST-Adapter: Parameter-Efficient Image-to-Video Transfer LearningNeural Information Processing Systems (NeurIPS), 2022
Junting Pan
Ziyi Lin
Xiatian Zhu
Jing Shao
Jiaming Song
404
266
0
27 Jun 2022
Semantic Role Aware Correlation Transformer for Text to Video Retrieval
Semantic Role Aware Correlation Transformer for Text to Video RetrievalInternational Conference on Information Photonics (ICIP), 2021
Burak Satar
Erik Cambria
Xavier Bresson
J. Lim
ViT
146
12
0
26 Jun 2022
RoME: Role-aware Mixture-of-Expert Transformer for Text-to-Video
  Retrieval
RoME: Role-aware Mixture-of-Expert Transformer for Text-to-Video Retrieval
Burak Satar
Erik Cambria
Hanwang Zhang
J. Lim
173
13
0
26 Jun 2022
Do Trajectories Encode Verb Meaning?
Do Trajectories Encode Verb Meaning?North American Chapter of the Association for Computational Linguistics (NAACL), 2022
Dylan Ebert
Chen Sun
Ellie Pavlick
206
2
0
23 Jun 2022
Self-Supervised Learning for Videos: A Survey
Self-Supervised Learning for Videos: A SurveyACM Computing Surveys (ACM CSUR), 2022
Madeline Chantry Schiappa
Yogesh S Rawat
M. Shah
SSL
480
168
0
18 Jun 2022
Zero-Shot Video Question Answering via Frozen Bidirectional Language
  Models
Zero-Shot Video Question Answering via Frozen Bidirectional Language ModelsNeural Information Processing Systems (NeurIPS), 2022
Antoine Yang
Antoine Miech
Josef Sivic
Ivan Laptev
Cordelia Schmid
485
277
0
16 Jun 2022
VCT: A Video Compression Transformer
VCT: A Video Compression TransformerNeural Information Processing Systems (NeurIPS), 2022
Fabian Mentzer
G. Toderici
David C. Minnen
S. Hwang
Sergi Caelles
Mario Lucic
E. Agustsson
ViT
206
129
0
15 Jun 2022
LAVENDER: Unifying Video-Language Understanding as Masked Language
  Modeling
LAVENDER: Unifying Video-Language Understanding as Masked Language ModelingComputer Vision and Pattern Recognition (CVPR), 2022
Linjie Li
Zhe Gan
Kevin Qinghong Lin
Chung-Ching Lin
Zicheng Liu
Ce Liu
Lijuan Wang
MLLMVLM
207
94
0
14 Jun 2022
LIFT: Language-Interfaced Fine-Tuning for Non-Language Machine Learning
  Tasks
LIFT: Language-Interfaced Fine-Tuning for Non-Language Machine Learning TasksNeural Information Processing Systems (NeurIPS), 2022
Tuan Dinh
Yuchen Zeng
Ruisu Zhang
Ziqian Lin
Michael Gira
Shashank Rajput
Jy-yong Sohn
Dimitris Papailiopoulos
Kangwook Lee
LMTD
576
172
0
14 Jun 2022
Multimodal Learning with Transformers: A Survey
Multimodal Learning with Transformers: A SurveyIEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2022
Peng Xu
Xiatian Zhu
David Clifton
ViT
571
846
0
13 Jun 2022
Revealing Single Frame Bias for Video-and-Language Learning
Revealing Single Frame Bias for Video-and-Language LearningAnnual Meeting of the Association for Computational Linguistics (ACL), 2022
Jie Lei
Tamara L. Berg
Joey Tianyi Zhou
239
142
0
07 Jun 2022
Beyond Just Vision: A Review on Self-Supervised Representation Learning
  on Multimodal and Temporal Data
Beyond Just Vision: A Review on Self-Supervised Representation Learning on Multimodal and Temporal Data
Shohreh Deldari
Hao Xue
Aaqib Saeed
Jiayuan He
Daniel V. Smith
Flora D. Salim
AI4TS
257
43
0
06 Jun 2022
Revisiting the "Video" in Video-Language Understanding
Revisiting the "Video" in Video-Language UnderstandingComputer Vision and Pattern Recognition (CVPR), 2022
S. Buch
Cristobal Eyzaguirre
Adrien Gaidon
Jiajun Wu
L. Fei-Fei
Juan Carlos Niebles
216
202
0
03 Jun 2022
Egocentric Video-Language Pretraining
Egocentric Video-Language PretrainingNeural Information Processing Systems (NeurIPS), 2022
Kevin Qinghong Lin
Alex Jinpeng Wang
Mattia Soldan
Michael Wray
Rui Yan
...
Hongfa Wang
Dima Damen
Guohao Li
Wei Liu
Mike Zheng Shou
VLMEgoV
268
249
0
03 Jun 2022
TransFuser: Imitation with Transformer-Based Sensor Fusion for
  Autonomous Driving
TransFuser: Imitation with Transformer-Based Sensor Fusion for Autonomous DrivingIEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2022
Kashyap Chitta
Aditya Prakash
Bernhard Jaeger
Zehao Yu
Katrin Renz
Andreas Geiger
ViT
615
522
0
31 May 2022
GIT: A Generative Image-to-text Transformer for Vision and Language
GIT: A Generative Image-to-text Transformer for Vision and Language
Jianfeng Wang
Zhengyuan Yang
Xiaowei Hu
Linjie Li
Kevin Qinghong Lin
Zhe Gan
Zicheng Liu
Ce Liu
Lijuan Wang
VLM
613
714
0
27 May 2022
Sample-Efficient Optimisation with Probabilistic Transformer Surrogates
Sample-Efficient Optimisation with Probabilistic Transformer Surrogates
A. Maraval
Matthieu Zimmer
Antoine Grosnit
Rasul Tutunov
Jun Wang
H. Ammar
179
2
0
27 May 2022
VTP: Volumetric Transformer for Multi-view Multi-person 3D Pose
  Estimation
VTP: Volumetric Transformer for Multi-view Multi-person 3D Pose Estimation
Yuxing Chen
Renshu Gu
Ouhan Huang
Gangyong Jia
3DH
205
13
0
25 May 2022
Multimodal Conversational AI: A Survey of Datasets and Approaches
Multimodal Conversational AI: A Survey of Datasets and Approaches
Anirudh S. Sundar
Larry Heck
166
33
0
13 May 2022
Learning to Answer Visual Questions from Web Videos
Learning to Answer Visual Questions from Web VideosIEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2022
Antoine Yang
Antoine Miech
Josef Sivic
Ivan Laptev
Cordelia Schmid
ViT
330
39
0
10 May 2022
Chart Question Answering: State of the Art and Future Directions
Chart Question Answering: State of the Art and Future Directions
Enamul Hoque
P. Kavehzadeh
Ahmed Masry
163
53
0
08 May 2022
Cross-modal Representation Learning for Zero-shot Action Recognition
Cross-modal Representation Learning for Zero-shot Action RecognitionComputer Vision and Pattern Recognition (CVPR), 2022
Chung-Ching Lin
Kevin Qinghong Lin
Linjie Li
Lijuan Wang
Zicheng Liu
ViT
152
30
0
03 May 2022
Answer-Me: Multi-Task Open-Vocabulary Visual Question Answering
Answer-Me: Multi-Task Open-Vocabulary Visual Question Answering
A. Piergiovanni
Wei Li
Weicheng Kuo
M. Saffar
Fred Bertsch
A. Angelova
281
18
0
02 May 2022
CenterCLIP: Token Clustering for Efficient Text-Video Retrieval
CenterCLIP: Token Clustering for Efficient Text-Video RetrievalAnnual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), 2022
Shuai Zhao
Linchao Zhu
Xiaohan Wang
Yi Yang
VLMCLIP
206
153
0
02 May 2022
Flamingo: a Visual Language Model for Few-Shot Learning
Flamingo: a Visual Language Model for Few-Shot LearningNeural Information Processing Systems (NeurIPS), 2022
Jean-Baptiste Alayrac
Jeff Donahue
Pauline Luc
Antoine Miech
Iain Barr
...
Mikolaj Binkowski
Ricardo Barreira
Oriol Vinyals
Andrew Zisserman
Karen Simonyan
MLLMVLM
713
4,954
0
29 Apr 2022
Where in the World is this Image? Transformer-based Geo-localization in
  the Wild
Where in the World is this Image? Transformer-based Geo-localization in the WildEuropean Conference on Computer Vision (ECCV), 2022
Shraman Pramanick
E. Nowara
Joshua Gleason
Carlos D. Castillo
Rama Chellappa
ViT
212
60
0
29 Apr 2022
Relevance-based Margin for Contrastively-trained Video Retrieval Models
Relevance-based Margin for Contrastively-trained Video Retrieval ModelsInternational Conference on Multimedia Retrieval (ICMR), 2022
Alex Falcon
Swathikiran Sudhakaran
G. Serra
Sergio Escalera
Oswald Lanz
371
10
0
27 Apr 2022
MILES: Visual BERT Pre-training with Injected Language Semantics for
  Video-text Retrieval
MILES: Visual BERT Pre-training with Injected Language Semantics for Video-text RetrievalEuropean Conference on Computer Vision (ECCV), 2022
Yuying Ge
Yixiao Ge
Xihui Liu
Alex Jinpeng Wang
Jianping Wu
Ying Shan
Xiaohu Qie
Ping Luo
VLM
162
48
0
26 Apr 2022
Contrastive Language-Action Pre-training for Temporal Localization
Contrastive Language-Action Pre-training for Temporal Localization
Mengmeng Xu
Erhan Gundogdu
⋆⋆ Maksim
Guohao Li
M. Donoser
Loris Bazzani
199
25
0
26 Apr 2022
Training and challenging models for text-guided fashion image retrieval
Training and challenging models for text-guided fashion image retrieval
Eric Dodds
Jack Culpepper
Gaurav Srivastava
150
10
0
23 Apr 2022
A Multi-level Alignment Training Scheme for Video-and-Language Grounding
A Multi-level Alignment Training Scheme for Video-and-Language Grounding
Yubo Zhang
Feiyang Niu
Q. Ping
Govind Thattai
CVBM
219
2
0
22 Apr 2022
Imagination-Augmented Natural Language Understanding
Imagination-Augmented Natural Language UnderstandingNorth American Chapter of the Association for Computational Linguistics (NAACL), 2022
Yujie Lu
Wanrong Zhu
Xinze Wang
Miguel P. Eckstein
William Yang Wang
227
25
0
18 Apr 2022
Modality-Balanced Embedding for Video Retrieval
Modality-Balanced Embedding for Video RetrievalAnnual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), 2022
Xun Wang
Bingqing Ke
Xuanping Li
Fangyu Liu
Mingyu Zhang
Xiao Liang
Qi-En Xiao
Cheng Luo
Yue Yu
144
12
0
18 Apr 2022
End-to-end Dense Video Captioning as Sequence Generation
End-to-end Dense Video Captioning as Sequence GenerationInternational Conference on Computational Linguistics (COLING), 2022
Wanrong Zhu
Bo Pang
Ashish V. Thapliyal
William Yang Wang
Radu Soricut
DiffM
227
44
0
18 Apr 2022
Probabilistic Compositional Embeddings for Multimodal Image Retrieval
Probabilistic Compositional Embeddings for Multimodal Image Retrieval
Andrei Neculai
Yanbei Chen
Zeynep Akata
CoGe
275
44
0
12 Apr 2022
Are Multimodal Transformers Robust to Missing Modality?
Are Multimodal Transformers Robust to Missing Modality?Computer Vision and Pattern Recognition (CVPR), 2022
Mengmeng Ma
Jian Ren
Long Zhao
Davide Testuggine
Xi Peng
ViT
316
214
0
12 Apr 2022
Hierarchical Self-supervised Representation Learning for Movie
  Understanding
Hierarchical Self-supervised Representation Learning for Movie UnderstandingComputer Vision and Pattern Recognition (CVPR), 2022
Fanyi Xiao
Kaustav Kundu
Joseph Tighe
Davide Modolo
SSL
202
27
0
06 Apr 2022
Modeling Motion with Multi-Modal Features for Text-Based Video
  Segmentation
Modeling Motion with Multi-Modal Features for Text-Based Video SegmentationComputer Vision and Pattern Recognition (CVPR), 2022
Wangbo Zhao
Kai Wang
Xiangxiang Chu
Fuzhao Xue
Xinchao Wang
Yang You
243
30
0
06 Apr 2022
Long Movie Clip Classification with State-Space Video Models
Long Movie Clip Classification with State-Space Video ModelsEuropean Conference on Computer Vision (ECCV), 2022
Md. Mohaiminul Islam
Gedas Bertasius
VLM
436
140
0
04 Apr 2022
Do As I Can, Not As I Say: Grounding Language in Robotic Affordances
Do As I Can, Not As I Say: Grounding Language in Robotic AffordancesConference on Robot Learning (CoRL), 2022
Michael Ahn
Anthony Brohan
Noah Brown
Yevgen Chebotar
Omar Cortes
...
Ted Xiao
Peng Xu
Sichun Xu
Mengyuan Yan
Andy Zeng
LM&Ro
595
2,634
0
04 Apr 2022
Do Vision-Language Pretrained Models Learn Composable Primitive
  Concepts?
Do Vision-Language Pretrained Models Learn Composable Primitive Concepts?
Tian Yun
Usha Bhalla
Ellie Pavlick
Chen Sun
ReLMCoGeVLMLRM
279
35
0
31 Mar 2022
Video-Text Representation Learning via Differentiable Weak Temporal
  Alignment
Video-Text Representation Learning via Differentiable Weak Temporal AlignmentComputer Vision and Pattern Recognition (CVPR), 2022
Dohwan Ko
Joonmyung Choi
Juyeon Ko
Shinyeong Noh
Kyoung-Woon On
Eun-Sol Kim
Hyunwoo J. Kim
VGenAI4TS
173
27
0
31 Mar 2022
TubeDETR: Spatio-Temporal Video Grounding with Transformers
TubeDETR: Spatio-Temporal Video Grounding with TransformersComputer Vision and Pattern Recognition (CVPR), 2022
Antoine Yang
Antoine Miech
Josef Sivic
Ivan Laptev
Cordelia Schmid
ViT
341
121
0
30 Mar 2022
Locality-Aware Inter-and Intra-Video Reconstruction for Self-Supervised
  Correspondence Learning
Locality-Aware Inter-and Intra-Video Reconstruction for Self-Supervised Correspondence LearningComputer Vision and Pattern Recognition (CVPR), 2022
Liulei Li
Tianfei Zhou
Wenguan Wang
Pu Cao
Jian-Wei Li
Yi Yang
SSL
292
56
0
27 Mar 2022
Audio-Adaptive Activity Recognition Across Video Domains
Audio-Adaptive Activity Recognition Across Video DomainsComputer Vision and Pattern Recognition (CVPR), 2022
Yun C. Zhang
Hazel Doughty
Ling Shao
Cees G. M. Snoek
193
50
0
27 Mar 2022
MQDD: Pre-training of Multimodal Question Duplicity Detection for
  Software Engineering Domain
MQDD: Pre-training of Multimodal Question Duplicity Detection for Software Engineering DomainRecent Advances in Natural Language Processing (RANLP), 2022
Jan Pasek
Jakub Sido
Miloslav Konopík
O. Pražák
203
1
0
26 Mar 2022
Give Me Your Attention: Dot-Product Attention Considered Harmful for
  Adversarial Patch Robustness
Give Me Your Attention: Dot-Product Attention Considered Harmful for Adversarial Patch RobustnessComputer Vision and Pattern Recognition (CVPR), 2022
Giulio Lovisotto
Nicole Finnie
Mauricio Muñoz
Chaithanya Kumar Mummadi
J. H. Metzen
AAMLViT
141
49
0
25 Mar 2022
Reshaping Robot Trajectories Using Natural Language Commands: A Study of
  Multi-Modal Data Alignment Using Transformers
Reshaping Robot Trajectories Using Natural Language Commands: A Study of Multi-Modal Data Alignment Using TransformersIEEE/RJS International Conference on Intelligent RObots and Systems (IROS), 2022
A. Bucker
Luis F. C. Figueredo
Sami Haddadin
Ashish Kapoor
Shuang Ma
Rogerio Bonatti
LM&Ro
243
61
0
25 Mar 2022
LocATe: End-to-end Localization of Actions in 3D with Transformers
LocATe: End-to-end Localization of Actions in 3D with Transformers
Jiankai Sun
Bolei Zhou
Michael J. Black
Arjun Chandrasekaran
263
9
0
21 Mar 2022
Local-Global Context Aware Transformer for Language-Guided Video
  Segmentation
Local-Global Context Aware Transformer for Language-Guided Video SegmentationIEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2022
Chen Liang
Wenguan Wang
Tianfei Zhou
Jiaxu Miao
Yawei Luo
Yi Yang
VOS
325
101
0
18 Mar 2022
Previous
123...8910...151617
Next
Page 9 of 17
Pageof 17