ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2026 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 1904.01766
  4. Cited By
VideoBERT: A Joint Model for Video and Language Representation Learning
v1v2 (latest)

VideoBERT: A Joint Model for Video and Language Representation Learning

3 April 2019
Chen Sun
Austin Myers
Carl Vondrick
Kevin Patrick Murphy
Cordelia Schmid
    VLMSSL
ArXiv (abs)PDFHTML

Papers citing "VideoBERT: A Joint Model for Video and Language Representation Learning"

50 / 803 papers shown
Nearest Neighbor Future Captioning: Generating Descriptions for Possible
  Collisions in Object Placement Tasks
Nearest Neighbor Future Captioning: Generating Descriptions for Possible Collisions in Object Placement Tasks
Takumi Komatsu
Motonari Kambara
Shumpei Hatanaka
Haruka Matsuo
Tsubasa Hirakawa
Takayoshi Yamashita
H. Fujiyoshi
Komei Sugiura
243
2
0
18 Jul 2024
Missing Modality Prediction for Unpaired Multimodal Learning via Joint
  Embedding of Unimodal Models
Missing Modality Prediction for Unpaired Multimodal Learning via Joint Embedding of Unimodal Models
Donggeun Kim
Taesup Kim
265
12
0
17 Jul 2024
Meta-optimized Angular Margin Contrastive Framework for Video-Language
  Representation Learning
Meta-optimized Angular Margin Contrastive Framework for Video-Language Representation Learning
Thong Nguyen
Yi Bin
Xiaobao Wu
Xinshuai Dong
Zhiyuan Hu
Khoi M. Le
Cong-Duy Nguyen
See-Kiong Ng
Luu Anh Tuan
479
10
0
04 Jul 2024
MIREncoder: Multi-modal IR-based Pretrained Embeddings for Performance
  Optimizations
MIREncoder: Multi-modal IR-based Pretrained Embeddings for Performance Optimizations
Akash Dutta
Ali Jannesari
235
3
0
02 Jul 2024
Enhancing Video-Language Representations with Structural Spatio-Temporal
  Alignment
Enhancing Video-Language Representations with Structural Spatio-Temporal Alignment
Hao Fei
Shengqiong Wu
Meishan Zhang
Hao Fei
Tat-Seng Chua
Shuicheng Yan
AI4TS
277
66
0
27 Jun 2024
Multimodal Large Language Models with Fusion Low Rank Adaptation for
  Device Directed Speech Detection
Multimodal Large Language Models with Fusion Low Rank Adaptation for Device Directed Speech DetectionInterspeech (Interspeech), 2024
Shruti Palaskar
Oggi Rudovic
Sameer Dharur
Florian Pesce
G. Krishna
Aswin Sivaraman
Jack Berkowitz
Ahmed Hussen Abdelaziz
Saurabh N. Adya
Ahmed H. Tewfik
VLM
177
3
0
13 Jun 2024
ProTrain: Efficient LLM Training via Memory-Aware Techniques
ProTrain: Efficient LLM Training via Memory-Aware Techniques
Hanmei Yang
Jin Zhou
Yao Fu
Xiaoqun Wang
Ramine Roane
Hui Guan
Tongping Liu
VLM
234
3
0
12 Jun 2024
Vision Model Pre-training on Interleaved Image-Text Data via Latent
  Compression Learning
Vision Model Pre-training on Interleaved Image-Text Data via Latent Compression Learning
Chenyu Yang
Xizhou Zhu
Jinguo Zhu
Weijie Su
Junjie Wang
...
Lewei Lu
Bin Li
Jie Zhou
Yu Qiao
Jifeng Dai
VLMCLIP
200
8
0
11 Jun 2024
AutoTVG: A New Vision-language Pre-training Paradigm for Temporal Video
  Grounding
AutoTVG: A New Vision-language Pre-training Paradigm for Temporal Video Grounding
Xing Zhang
Jiaxi Gu
Haoyu Zhao
Shicong Wang
Hang Xu
Renjing Pei
Songcen Xu
Zuxuan Wu
Yu-Gang Jiang
267
0
0
11 Jun 2024
Video-Language Understanding: A Survey from Model Architecture, Model Training, and Data Perspectives
Video-Language Understanding: A Survey from Model Architecture, Model Training, and Data PerspectivesAnnual Meeting of the Association for Computational Linguistics (ACL), 2024
Thong Nguyen
Yi Bin
Junbin Xiao
Leigang Qu
Yicong Li
Jay Zhangjie Wu
Cong-Duy Nguyen
See-Kiong Ng
Luu Anh Tuan
VLM
587
27
1
09 Jun 2024
Seeing the Unseen: Visual Metaphor Captioning for Videos
Seeing the Unseen: Visual Metaphor Captioning for VideosConference on Empirical Methods in Natural Language Processing (EMNLP), 2024
Abisek Rajakumar Kalarani
Pushpak Bhattacharyya
Sumit Shekhar
VLM
164
1
0
07 Jun 2024
A Survey of Language-Based Communication in Robotics
A Survey of Language-Based Communication in Robotics
William Hunt
Sarvapali D. Ramchurn
Mohammad D. Soorati
LM&Ro
711
17
0
06 Jun 2024
MuJo: Multimodal Joint Feature Space Learning for Human Activity Recognition
MuJo: Multimodal Joint Feature Space Learning for Human Activity Recognition
Stefan Gerd Fritsch
Cennet Oğuz
Vitor Fortes Rey
L. Ray
Maximilian Kiefer-Emmanouilidis
Paul Lukowicz
HAI
469
3
0
06 Jun 2024
FILS: Self-Supervised Video Feature Prediction In Semantic Language
  Space
FILS: Self-Supervised Video Feature Prediction In Semantic Language Space
Mona Ahmadian
Frank Guerin
Andrew Gilbert
333
2
0
05 Jun 2024
GeminiFusion: Efficient Pixel-wise Multimodal Fusion for Vision
  Transformer
GeminiFusion: Efficient Pixel-wise Multimodal Fusion for Vision Transformer
Ding Jia
Jianyuan Guo
Kai Han
Han Wu
Chao Zhang
Chang Xu
Xinghao Chen
ViT
512
49
0
03 Jun 2024
WIDIn: Wording Image for Domain-Invariant Representation in
  Single-Source Domain Generalization
WIDIn: Wording Image for Domain-Invariant Representation in Single-Source Domain Generalization
Jiawei Ma
Yulei Niu
Shiyuan Huang
G. Han
Shih-Fu Chang
VLM
172
1
0
28 May 2024
A Survey on Vision-Language-Action Models for Embodied AI
A Survey on Vision-Language-Action Models for Embodied AI
Yueen Ma
Zixing Song
Yuzheng Zhuang
Jianye Hao
Irwin King
LM&Ro
893
169
0
23 May 2024
From CNNs to Transformers in Multimodal Human Action Recognition: A
  Survey
From CNNs to Transformers in Multimodal Human Action Recognition: A Survey
Muhammad Bilal Shaikh
Syed Mohammed Shamsul Islam
Douglas Chai
Naveed Akhtar
347
30
0
22 May 2024
A Tale of Two Languages: Large-Vocabulary Continuous Sign Language
  Recognition from Spoken Language Supervision
A Tale of Two Languages: Large-Vocabulary Continuous Sign Language Recognition from Spoken Language Supervision
Charles Raude
Prajwal K R
Liliane Momeni
Hannah Bull
Samuel Albanie
Andrew Zisserman
Gül Varol
SLR
326
8
0
16 May 2024
PriorCLIP: Visual Prior Guided Vision-Language Model for Remote Sensing Image-Text Retrieval
PriorCLIP: Visual Prior Guided Vision-Language Model for Remote Sensing Image-Text Retrieval
Jiancheng Pan
Muyuan Ma
Qing Ma
Cong Bai
Shengyong Chen
258
12
0
16 May 2024
Unified Video-Language Pre-training with Synchronized Audio
Unified Video-Language Pre-training with Synchronized Audio
Shentong Mo
Haofan Wang
Huaxia Li
Xu Tang
270
2
0
12 May 2024
Learning Object States from Actions via Large Language Models
Learning Object States from Actions via Large Language Models
Masatoshi Tateno
Takuma Yagi
Ryosuke Furuta
Yoichi Sato
136
2
0
02 May 2024
Mamba-360: Survey of State Space Models as Transformer Alternative for
  Long Sequence Modelling: Methods, Applications, and Challenges
Mamba-360: Survey of State Space Models as Transformer Alternative for Long Sequence Modelling: Methods, Applications, and Challenges
Badri N. Patro
Vijay Srinivas Agneeswaran
Mamba
362
76
0
24 Apr 2024
SHE-Net: Syntax-Hierarchy-Enhanced Text-Video Retrieval
SHE-Net: Syntax-Hierarchy-Enhanced Text-Video Retrieval
Xuzheng Yu
Chen Jiang
Xingning Dong
Tian Gan
Ming Yang
Qingpei Guo
404
4
0
22 Apr 2024
Towards a Foundation Model for Partial Differential Equations: Multi-Operator Learning and Extrapolation
Towards a Foundation Model for Partial Differential Equations: Multi-Operator Learning and Extrapolation
Jingmin Sun
Yuxuan Liu
Zecheng Zhang
Hayden Schaeffer
AI4CE
406
39
0
18 Apr 2024
Multi-Task Multi-Modal Self-Supervised Learning for Facial Expression
  Recognition
Multi-Task Multi-Modal Self-Supervised Learning for Facial Expression Recognition
Marah Halawa
Florian Blume
Pia Bideau
Martin Maier
Rasha Abdel Rahman
Olaf Hellwich
CVBM
230
4
0
16 Apr 2024
Guided Masked Self-Distillation Modeling for Distributed Multimedia
  Sensor Event Analysis
Guided Masked Self-Distillation Modeling for Distributed Multimedia Sensor Event Analysis
Masahiro Yasuda
Noboru Harada
Yasunori Ohishi
Shoichiro Saito
Akira Nakayama
Nobutaka Ono
273
6
0
12 Apr 2024
MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video
  Understanding
MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding
Bo He
Hengduo Li
Young Kyun Jang
Menglin Jia
Xuefei Cao
Ashish Shah
Abhinav Shrivastava
Ser-Nam Lim
MLLM
360
181
0
08 Apr 2024
Vision Transformers in Domain Adaptation and Generalization: A Study of
  Robustness
Vision Transformers in Domain Adaptation and Generalization: A Study of Robustness
Shadi Alijani
Jamil Fayyad
Homayoun Najjaran
OOD
314
1
0
05 Apr 2024
Learning Correlation Structures for Vision Transformers
Learning Correlation Structures for Vision Transformers
Manjin Kim
Paul Hongsuck Seo
Cordelia Schmid
Minsu Cho
ViT
298
25
0
05 Apr 2024
SUGAR: Pre-training 3D Visual Representations for Robotics
SUGAR: Pre-training 3D Visual Representations for RoboticsComputer Vision and Pattern Recognition (CVPR), 2024
Shizhe Chen
Ricardo Garcia Pinel
Ivan Laptev
Cordelia Schmid
258
33
0
01 Apr 2024
FSMR: A Feature Swapping Multi-modal Reasoning Approach with Joint
  Textual and Visual Clues
FSMR: A Feature Swapping Multi-modal Reasoning Approach with Joint Textual and Visual Clues
Shuang Li
Jiahua Wang
Lijie Wen
LRM
151
0
0
29 Mar 2024
Enhancing Efficiency in Vision Transformer Networks: Design Techniques
  and Insights
Enhancing Efficiency in Vision Transformer Networks: Design Techniques and Insights
Moein Heidari
Reza Azad
Sina Ghorbani Kolahi
René Arimond
Leon Niggemeier
...
Afshin Bozorgpour
Ehsan Khodapanah Aghdam
Amirhossein Kazerouni
Ilker Hacihaliloglu
Dorit Merhof
304
14
0
28 Mar 2024
Multi-Stage Multi-Modal Pre-Training for Automatic Speech Recognition
Multi-Stage Multi-Modal Pre-Training for Automatic Speech Recognition
Yash Jain
David M. Chan
Pranav Dheram
Aparna Khare
Olabanji Shonibare
Venkatesh Ravichandran
Shalini Ghosh
254
2
0
28 Mar 2024
Dense Vision Transformer Compression with Few Samples
Dense Vision Transformer Compression with Few Samples
Hanxiao Zhang
Yifan Zhou
Guo-Hua Wang
Jianxin Wu
ViTVLM
230
10
0
27 Mar 2024
Generative Multi-modal Models are Good Class-Incremental Learners
Generative Multi-modal Models are Good Class-Incremental Learners
Xusheng Cao
Haori Lu
Linlan Huang
Xialei Liu
Ming-Ming Cheng
CLL
314
26
0
27 Mar 2024
InternVideo2: Scaling Video Foundation Models for Multimodal Video
  Understanding
InternVideo2: Scaling Video Foundation Models for Multimodal Video UnderstandingEuropean Conference on Computer Vision (ECCV), 2024
Yi Wang
Kunchang Li
Xinhao Li
Jiashuo Yu
Yinan He
...
Hongjie Zhang
Yifei Huang
Yu Qiao
Yali Wang
Limin Wang
262
104
0
22 Mar 2024
Semantic-Enhanced Representation Learning for Road Networks with
  Temporal Dynamics
Semantic-Enhanced Representation Learning for Road Networks with Temporal DynamicsIEEE Transactions on Mobile Computing (IEEE TMC), 2024
Yile Chen
Xiucheng Li
Gao Cong
Zhifeng Bao
Cheng Long
195
7
0
18 Mar 2024
Video Mamba Suite: State Space Model as a Versatile Alternative for
  Video Understanding
Video Mamba Suite: State Space Model as a Versatile Alternative for Video Understanding
Guo Chen
Yifei Huang
Jilan Xu
Baoqi Pei
Zhe Chen
Zhiqi Li
Jiahao Wang
Kunchang Li
Tong Lu
Limin Wang
Mamba
279
126
0
14 Mar 2024
DAM: Dynamic Adapter Merging for Continual Video QA Learning
DAM: Dynamic Adapter Merging for Continual Video QA LearningIEEE Workshop/Winter Conference on Applications of Computer Vision (WACV), 2024
Feng Cheng
Ziyang Wang
Yi-Lin Sung
Yan-Bo Lin
Mohit Bansal
Gedas Bertasius
CLLMoMe
361
18
0
13 Mar 2024
VideoMamba: State Space Model for Efficient Video Understanding
VideoMamba: State Space Model for Efficient Video UnderstandingEuropean Conference on Computer Vision (ECCV), 2024
Kunchang Li
Xinhao Li
Yi Wang
Yinan He
Yali Wang
Limin Wang
Yu Qiao
Mamba
284
390
0
11 Mar 2024
Materials science in the era of large language models: a perspective
Materials science in the era of large language models: a perspectiveDigital Discovery (DD), 2024
Ge Lei
Ronan Docherty
Samuel J. Cooper
230
41
0
11 Mar 2024
On the Generalization Ability of Unsupervised Pretraining
On the Generalization Ability of Unsupervised PretrainingInternational Conference on Artificial Intelligence and Statistics (AISTATS), 2024
Yuyang Deng
Junyuan Hong
Jiayu Zhou
M. Mahdavi
SSL
223
8
0
11 Mar 2024
CAD-SIGNet: CAD Language Inference from Point Clouds using Layer-wise
  Sketch Instance Guided Attention
CAD-SIGNet: CAD Language Inference from Point Clouds using Layer-wise Sketch Instance Guided Attention
Mohammad Sadil Khan
Elona Dupont
Sk Aziz Ali
K. Cherenkova
Anis Kacem
Djamila Aouada
3DV3DPC
293
42
0
27 Feb 2024
Event-aware Video Corpus Moment Retrieval
Event-aware Video Corpus Moment Retrieval
Danyang Hou
Liang Pang
Huawei Shen
Xueqi Cheng
250
3
0
21 Feb 2024
LLMs Meet Long Video: Advancing Long Video Comprehension with An
  Interactive Visual Adapter in LLMs
LLMs Meet Long Video: Advancing Long Video Comprehension with An Interactive Visual Adapter in LLMs
Yunxin Li
Xinyu Chen
Baotain Hu
Min Zhang
265
9
0
21 Feb 2024
Video ReCap: Recursive Captioning of Hour-Long Videos
Video ReCap: Recursive Captioning of Hour-Long Videos
Md. Mohaiminul Islam
Ngan Ho
Xitong Yang
Tushar Nagarajan
Lorenzo Torresani
Gedas Bertasius
VGenVLM
670
82
0
20 Feb 2024
Momentor: Advancing Video Large Language Model with Fine-Grained
  Temporal Reasoning
Momentor: Advancing Video Large Language Model with Fine-Grained Temporal Reasoning
Long Qian
Juncheng Billy Li
Yu-hao Wu
Yaobo Ye
Hao Fei
Tat-Seng Chua
Yueting Zhuang
Siliang Tang
MLLMLRM
370
100
0
18 Feb 2024
Revisiting Feature Prediction for Learning Visual Representations from
  Video
Revisiting Feature Prediction for Learning Visual Representations from Video
Adrien Bardes
Q. Garrido
Jean Ponce
Xinlei Chen
Michael G. Rabbat
Yann LeCun
Mahmoud Assran
Nicolas Ballas
MDEVLM
345
177
0
15 Feb 2024
Comment-aided Video-Language Alignment via Contrastive Pre-training for
  Short-form Video Humor Detection
Comment-aided Video-Language Alignment via Contrastive Pre-training for Short-form Video Humor Detection
Yang Liu
Tongfei Shen
Dong Zhang
Qingying Sun
Shoushan Li
Guodong Zhou
263
5
0
14 Feb 2024
Previous
123456...151617
Next