ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2026 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 1904.01766
  4. Cited By
VideoBERT: A Joint Model for Video and Language Representation Learning
v1v2 (latest)

VideoBERT: A Joint Model for Video and Language Representation Learning

3 April 2019
Chen Sun
Austin Myers
Carl Vondrick
Kevin Patrick Murphy
Cordelia Schmid
    VLMSSL
ArXiv (abs)PDFHTML

Papers citing "VideoBERT: A Joint Model for Video and Language Representation Learning"

50 / 803 papers shown
RA-Rec: An Efficient ID Representation Alignment Framework for LLM-based
  Recommendation
RA-Rec: An Efficient ID Representation Alignment Framework for LLM-based Recommendation
Xiaohan Yu
Li Zhang
Xin Zhao
Yue Wang
Zhongrui Ma
161
14
0
07 Feb 2024
M2-RAAP: A Multi-Modal Recipe for Advancing Adaptation-based
  Pre-training towards Effective and Efficient Zero-shot Video-text Retrieval
M2-RAAP: A Multi-Modal Recipe for Advancing Adaptation-based Pre-training towards Effective and Efficient Zero-shot Video-text Retrieval
Xingning Dong
Zipeng Feng
Chunluan Zhou
Xuzheng Yu
Ming Yang
Qingpei Guo
VLM
254
5
0
31 Jan 2024
Towards Urban General Intelligence: A Review and Outlook of Urban Foundation Models
Towards Urban General Intelligence: A Review and Outlook of Urban Foundation Models
Weijiao Zhang
Jindong Han
Zhao Xu
Hang Ni
Hao Liu
Hui Xiong
Hui Xiong
AI4CE
554
26
0
30 Jan 2024
Cross-Modal Coordination Across a Diverse Set of Input Modalities
Cross-Modal Coordination Across a Diverse Set of Input Modalities
Jorge Sánchez
Rodrigo Laguna
VLM
238
0
0
29 Jan 2024
Dynamic Transformer Architecture for Continual Learning of Multimodal
  Tasks
Dynamic Transformer Architecture for Continual Learning of Multimodal Tasks
Yuliang Cai
Mohammad Rostami
357
6
0
27 Jan 2024
Multimodal Pathway: Improve Transformers with Irrelevant Data from Other
  Modalities
Multimodal Pathway: Improve Transformers with Irrelevant Data from Other ModalitiesComputer Vision and Pattern Recognition (CVPR), 2024
Yiyuan Zhang
Xiaohan Ding
Kaixiong Gong
Yixiao Ge
Ying Shan
Xiangyu Yue
ViT
312
11
0
25 Jan 2024
CoAVT: A Cognition-Inspired Unified Audio-Visual-Text Pre-Training Model
  for Multimodal Processing
CoAVT: A Cognition-Inspired Unified Audio-Visual-Text Pre-Training Model for Multimodal ProcessingIEEE Transactions on Audio, Speech, and Language Processing (IEEE TASLP), 2024
Xianghu Yue
Xiaohai Tian
Lu Lu
Malu Zhang
Zhizheng Wu
Haizhou Li
235
1
0
22 Jan 2024
DGL: Dynamic Global-Local Prompt Tuning for Text-Video Retrieval
DGL: Dynamic Global-Local Prompt Tuning for Text-Video Retrieval
Xiangpeng Yang
Linchao Zhu
Xiaohan Wang
Yi Yang
VLM
296
42
0
19 Jan 2024
Collaboratively Self-supervised Video Representation Learning for Action Recognition
Collaboratively Self-supervised Video Representation Learning for Action RecognitionIEEE Transactions on Information Forensics and Security (IEEE TIFS), 2024
Jie Zhang
Zhifan Wan
Lanqing Hu
Stephen Lin
Shuzhe Wu
Shiguang Shan
TTA
376
2
0
15 Jan 2024
Video Understanding with Large Language Models: A Survey
Video Understanding with Large Language Models: A Survey
Yunlong Tang
Jing Bi
Siting Xu
Luchuan Song
Susan Liang
...
Feng Zheng
Jianguo Zhang
Chenliang Xu
Jiebo Luo
Chenliang Xu
VLM
712
163
0
29 Dec 2023
Data-Efficient Multimodal Fusion on a Single GPU
Data-Efficient Multimodal Fusion on a Single GPUComputer Vision and Pattern Recognition (CVPR), 2023
Noël Vouitsis
Zhaoyan Liu
S. Gorti
Valentin Villecroze
Jesse C. Cresswell
Guangwei Yu
Gabriel Loaiza-Ganem
Anthony L. Caterini
460
8
0
15 Dec 2023
Audio-Visual LLM for Video Understanding
Audio-Visual LLM for Video Understanding
Fangxun Shu
Lei Zhang
Hao Jiang
Cihang Xie
VLMMLLM
240
66
0
11 Dec 2023
A Video is Worth 10,000 Words: Training and Benchmarking with Diverse
  Captions for Better Long Video Retrieval
A Video is Worth 10,000 Words: Training and Benchmarking with Diverse Captions for Better Long Video RetrievalIEEE Workshop/Winter Conference on Applications of Computer Vision (WACV), 2023
M. Gwilliam
Michael Cogswell
Meng Ye
Karan Sikka
Abhinav Shrivastava
Ajay Divakaran
3DV
288
1
1
30 Nov 2023
Spacewalk-18: A Benchmark for Multimodal and Long-form Procedural Video Understanding in Novel Domains
Spacewalk-18: A Benchmark for Multimodal and Long-form Procedural Video Understanding in Novel Domains
Rohan Myer Krishnan
Zitian Tang
Zhiqiu Yu
Chen Sun
496
2
0
30 Nov 2023
E-ViLM: Efficient Video-Language Model via Masked Video Modeling with
  Semantic Vector-Quantized Tokenizer
E-ViLM: Efficient Video-Language Model via Masked Video Modeling with Semantic Vector-Quantized Tokenizer
Jacob Zhiyuan Fang
Skyler Zheng
Vasu Sharma
Robinson Piramuthu
VLM
392
1
0
28 Nov 2023
Mitigating Object Hallucinations in Large Vision-Language Models through
  Visual Contrastive Decoding
Mitigating Object Hallucinations in Large Vision-Language Models through Visual Contrastive DecodingComputer Vision and Pattern Recognition (CVPR), 2023
Sicong Leng
Hang Zhang
Guanzheng Chen
Xin Li
Shijian Lu
Chunyan Miao
Li Bing
VLMMLLM
311
442
0
28 Nov 2023
Vamos: Versatile Action Models for Video Understanding
Vamos: Versatile Action Models for Video UnderstandingEuropean Conference on Computer Vision (ECCV), 2023
Shijie Wang
Qi Zhao
Minh Quan Do
Nakul Agarwal
Kwonjoon Lee
Chen Sun
389
35
0
22 Nov 2023
SPOT! Revisiting Video-Language Models for Event Understanding
SPOT! Revisiting Video-Language Models for Event Understanding
Gengyuan Zhang
Jinhe Bi
Jindong Gu
Yanyu Chen
Volker Tresp
443
14
0
21 Nov 2023
Advancing Drug Discovery with Enhanced Chemical Understanding via Asymmetric Contrastive Multimodal Learning
Advancing Drug Discovery with Enhanced Chemical Understanding via Asymmetric Contrastive Multimodal LearningJournal of Chemical Information and Modeling (JCIM), 2023
Hao Xu
Yifei Wang
Yunrui Li
Pengyu Hong
Pengyu Hong
371
1
0
11 Nov 2023
Towards A Unified Neural Architecture for Visual Recognition and
  Reasoning
Towards A Unified Neural Architecture for Visual Recognition and Reasoning
Calvin Luo
Boqing Gong
Ting Chen
Chen Sun
OCLObjD
163
1
0
10 Nov 2023
CLearViD: Curriculum Learning for Video Description
CLearViD: Curriculum Learning for Video Description
Cheng-Yu Chuang
Pooyan Fazli
152
1
0
08 Nov 2023
A Single 2D Pose with Context is Worth Hundreds for 3D Human Pose
  Estimation
A Single 2D Pose with Context is Worth Hundreds for 3D Human Pose EstimationNeural Information Processing Systems (NeurIPS), 2023
Qi-jun Zhao
Ce Zheng
Mengyuan Liu
Chong Chen
231
24
0
06 Nov 2023
ACQUIRED: A Dataset for Answering Counterfactual Questions In Real-Life
  Videos
ACQUIRED: A Dataset for Answering Counterfactual Questions In Real-Life VideosConference on Empirical Methods in Natural Language Processing (EMNLP), 2023
Te-Lin Wu
Zi-Yi Dou
Qingyuan Hu
Yu Hou
Nischal Reddy Chandra
Marjorie Freedman
R. Weischedel
Nanyun Peng
282
9
0
02 Nov 2023
Object-centric Video Representation for Long-term Action Anticipation
Object-centric Video Representation for Long-term Action AnticipationIEEE Workshop/Winter Conference on Applications of Computer Vision (WACV), 2023
Ce Zhang
Changcheng Fu
Shijie Wang
Nakul Agarwal
Kwonjoon Lee
Chiho Choi
Chen Sun
279
29
0
31 Oct 2023
Harvest Video Foundation Models via Efficient Post-Pretraining
Harvest Video Foundation Models via Efficient Post-Pretraining
Yizhuo Li
Kunchang Li
Yinan He
Yi Wang
Yali Wang
Limin Wang
Yu Qiao
Ping Luo
CLIPVLMVGen
350
3
0
30 Oct 2023
Generating Context-Aware Natural Answers for Questions in 3D Scenes
Generating Context-Aware Natural Answers for Questions in 3D ScenesBritish Machine Vision Conference (BMVC), 2023
Mohammed Munzer Dwedari
Matthias Niessner
Dave Zhenyu Chen
194
4
0
30 Oct 2023
MOSEL: Inference Serving Using Dynamic Modality Selection
MOSEL: Inference Serving Using Dynamic Modality SelectionConference on Empirical Methods in Natural Language Processing (EMNLP), 2023
Bodun Hu
Le Xu
Jeongyoon Moon
N. Yadwadkar
Aditya Akella
300
5
0
27 Oct 2023
ArchBERT: Bi-Modal Understanding of Neural Architectures and Natural
  Languages
ArchBERT: Bi-Modal Understanding of Neural Architectures and Natural LanguagesConference on Computational Natural Language Learning (CoNLL), 2023
Mohammad Akbari
Saeed Ranjbar Alvar
Behnam Kamranian
Amin Banitalebi-Dehkordi
Yong Zhang
AI4CE
139
2
0
26 Oct 2023
Exploring Iterative Refinement with Diffusion Models for Video Grounding
Exploring Iterative Refinement with Diffusion Models for Video GroundingIEEE International Conference on Multimedia and Expo (ICME), 2023
Xiao Liang
Tao Shi
Yaoyuan Liang
Te Tao
Shao-Lun Huang
DiffM
267
2
0
26 Oct 2023
CAD -- Contextual Multi-modal Alignment for Dynamic AVQA
CAD -- Contextual Multi-modal Alignment for Dynamic AVQAIEEE Workshop/Winter Conference on Applications of Computer Vision (WACV), 2023
Asmar Nadeem
Adrian Hilton
R. Dawes
Graham A. Thomas
A. Mustafa
302
14
0
25 Oct 2023
FloCoDe: Unbiased Dynamic Scene Graph Generation with Temporal
  Consistency and Correlation Debiasing
FloCoDe: Unbiased Dynamic Scene Graph Generation with Temporal Consistency and Correlation Debiasing
Anant Khandelwal
459
2
0
24 Oct 2023
UNK-VQA: A Dataset and a Probe into the Abstention Ability of
  Multi-modal Large Models
UNK-VQA: A Dataset and a Probe into the Abstention Ability of Multi-modal Large ModelsIEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2023
Yanyang Guo
Fangkai Jiao
Zhiqi Shen
Liqiang Nie
Mohan S. Kankanhalli
MLLM
389
13
0
17 Oct 2023
GePSAn: Generative Procedure Step Anticipation in Cooking Videos
GePSAn: Generative Procedure Step Anticipation in Cooking VideosIEEE International Conference on Computer Vision (ICCV), 2023
M. A. Abdelsalam
Samrudhdhi B. Rangrej
Isma Hadji
Nikita Dvornik
Konstantinos G. Derpanis
Afsaneh Fazly
AI4TS
217
8
0
12 Oct 2023
Latent Wander: an Alternative Interface for Interactive and
  Serendipitous Discovery of Large AV Archives
Latent Wander: an Alternative Interface for Interactive and Serendipitous Discovery of Large AV Archives
Yuchen Yang
Linyida Zhang
219
2
0
09 Oct 2023
HowToCaption: Prompting LLMs to Transform Video Annotations at Scale
HowToCaption: Prompting LLMs to Transform Video Annotations at ScaleEuropean Conference on Computer Vision (ECCV), 2023
Nina Shvetsova
Anna Kukleva
Xudong Hong
Christian Rupprecht
Bernt Schiele
Hilde Kuehne
297
31
0
07 Oct 2023
CLEVRER-Humans: Describing Physical and Causal Events the Human Way
CLEVRER-Humans: Describing Physical and Causal Events the Human WayNeural Information Processing Systems (NeurIPS), 2023
Jiayuan Mao
Xuelin Yang
Xikun Zhang
Noah D. Goodman
Jiajun Wu
NAI
333
22
0
05 Oct 2023
GRID: A Platform for General Robot Intelligence Development
GRID: A Platform for General Robot Intelligence Development
Sai H. Vemprala
Shuhang Chen
Abhinav Shukla
Dinesh Narayanan
Ashish Kapoor
271
11
0
02 Oct 2023
Skip-Plan: Procedure Planning in Instructional Videos via Condensed
  Action Space Learning
Skip-Plan: Procedure Planning in Instructional Videos via Condensed Action Space LearningIEEE International Conference on Computer Vision (ICCV), 2023
Zhiheng Li
Wenjia Geng
Muheng Li
Lei Chen
Yansong Tang
Jiwen Lu
Jie Zhou
178
12
0
01 Oct 2023
PROSE: Predicting Operators and Symbolic Expressions using Multimodal
  Transformers
PROSE: Predicting Operators and Symbolic Expressions using Multimodal Transformers
Yuxuan Liu
Zecheng Zhang
Hayden Schaeffer
211
22
0
28 Sep 2023
Social Media Fashion Knowledge Extraction as Captioning
Social Media Fashion Knowledge Extraction as Captioning
Yifei Yuan
Wenxuan Zhang
Yang Deng
Wai Lam
180
2
0
28 Sep 2023
ELIP: Efficient Language-Image Pre-training with Fewer Vision Tokens
ELIP: Efficient Language-Image Pre-training with Fewer Vision Tokens
Yangyang Guo
Haoyu Zhang
Yongkang Wong
Liqiang Nie
Mohan Kankanhalli
VLM
238
5
0
28 Sep 2023
Zero-Shot and Few-Shot Video Question Answering with Multi-Modal Prompts
Zero-Shot and Few-Shot Video Question Answering with Multi-Modal Prompts
Bipin Rajendran
Bashir M. Al-Hashimi
MLLMVLM
253
8
0
27 Sep 2023
VidChapters-7M: Video Chapters at Scale
VidChapters-7M: Video Chapters at ScaleNeural Information Processing Systems (NeurIPS), 2023
Antoine Yang
Arsha Nagrani
Ivan Laptev
Josef Sivic
Cordelia Schmid
VGen
246
39
0
25 Sep 2023
Revisiting Kernel Temporal Segmentation as an Adaptive Tokenizer for
  Long-form Video Understanding
Revisiting Kernel Temporal Segmentation as an Adaptive Tokenizer for Long-form Video Understanding
Mohamed Afham
Satya Narayan Shukla
Omid Poursaeed
Pengchuan Zhang
Ashish Shah
Sernam Lim
VLM
195
4
0
20 Sep 2023
Dual-Modal Attention-Enhanced Text-Video Retrieval with Triplet Partial Margin Contrastive Learning
Dual-Modal Attention-Enhanced Text-Video Retrieval with Triplet Partial Margin Contrastive LearningACM Multimedia (ACM MM), 2023
Chen Jiang
Hong Liu
Xuzheng Yu
Qing Wang
Yuan Cheng
...
Zhongyi Liu
Qingpei Guo
Wei Chu
Ming-Hsuan Yang
Yuan Qi
366
17
0
20 Sep 2023
Collaborative Three-Stream Transformers for Video Captioning
Collaborative Three-Stream Transformers for Video CaptioningComputer Vision and Image Understanding (CVIU), 2023
Hao Wang
Libo Zhang
Hengrui Fan
Tiejian Luo
193
8
0
18 Sep 2023
Disentangling Spatial and Temporal Learning for Efficient Image-to-Video
  Transfer Learning
Disentangling Spatial and Temporal Learning for Efficient Image-to-Video Transfer LearningIEEE International Conference on Computer Vision (ICCV), 2023
Zhiwu Qing
Shiwei Zhang
Ziyuan Huang
Yingya Zhang
Changxin Gao
Deli Zhao
Nong Sang
216
31
0
14 Sep 2023
Unified Pre-training with Pseudo Texts for Text-To-Image Person
  Re-identification
Unified Pre-training with Pseudo Texts for Text-To-Image Person Re-identificationIEEE International Conference on Computer Vision (ICCV), 2023
Zhiyin Shao
Xinyu Zhang
Changxing Ding
Jian Wang
Jingdong Wang
239
38
0
04 Sep 2023
COMEDIAN: Self-Supervised Learning and Knowledge Distillation for Action
  Spotting using Transformers
COMEDIAN: Self-Supervised Learning and Knowledge Distillation for Action Spotting using Transformers
J. Denize
Mykola Liashuha
Jaonary Rabarisoa
Astrid Orcesi
Romain Hérault
ViT
289
19
0
03 Sep 2023
A Fine-Grained Image Description Generation Method Based on Joint
  Objectives
A Fine-Grained Image Description Generation Method Based on Joint ObjectivesChinese Conference on Computer Supported Cooperative Work and Social Computing (SCWSC), 2023
Yifan Zhang
Chunzhen Lin
Donglin Cao
Dazhen Lin
EGVM
123
0
0
02 Sep 2023
Previous
12345...151617
Next