ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2026 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 1904.01766
  4. Cited By
VideoBERT: A Joint Model for Video and Language Representation Learning
v1v2 (latest)

VideoBERT: A Joint Model for Video and Language Representation Learning

3 April 2019
Chen Sun
Austin Myers
Carl Vondrick
Kevin Patrick Murphy
Cordelia Schmid
    VLMSSL
ArXiv (abs)PDFHTML

Papers citing "VideoBERT: A Joint Model for Video and Language Representation Learning"

50 / 803 papers shown
MGeo: Multi-Modal Geographic Pre-Training Method
MGeo: Multi-Modal Geographic Pre-Training MethodAnnual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), 2023
Ruixue Ding
Boli Chen
Pengjun Xie
Fei Huang
Xin Li
Qiang-Wei Zhang
Yao Xu
273
29
0
11 Jan 2023
Universal Multimodal Representation for Language Understanding
Universal Multimodal Representation for Language UnderstandingIEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2023
Zhuosheng Zhang
Kehai Chen
Rui Wang
Masao Utiyama
Eiichiro Sumita
Z. Li
Hai Zhao
SSL
291
30
0
09 Jan 2023
MAQA: A Multimodal QA Benchmark for Negation
MAQA: A Multimodal QA Benchmark for Negation
Judith Yue Li
Aren Jansen
Qingqing Huang
Joonseok Lee
Ravi Ganti
Dima Kuzmin
216
7
0
09 Jan 2023
Logically at Factify 2: A Multi-Modal Fact Checking System Based on
  Evidence Retrieval techniques and Transformer Encoder Architecture
Logically at Factify 2: A Multi-Modal Fact Checking System Based on Evidence Retrieval techniques and Transformer Encoder Architecture
P. Verschuuren
Jie Gao
A. V. Eeden
Stylianos Oikonomou
Anil Bandhakavi
269
2
0
09 Jan 2023
Test of Time: Instilling Video-Language Models with a Sense of Time
Test of Time: Instilling Video-Language Models with a Sense of TimeComputer Vision and Pattern Recognition (CVPR), 2023
Piyush Bagad
Makarand Tapaswi
Cees G. M. Snoek
465
47
0
05 Jan 2023
Learning Trajectory-Word Alignments for Video-Language Tasks
Learning Trajectory-Word Alignments for Video-Language TasksIEEE International Conference on Computer Vision (ICCV), 2023
Xu Yang
Zhang Li
Haiyang Xu
Hanwang Zhang
Qinghao Ye
Chenliang Li
Ming Yan
Yu Zhang
Fei Huang
Songfang Huang
215
7
0
05 Jan 2023
NaQ: Leveraging Narrations as Queries to Supervise Episodic Memory
NaQ: Leveraging Narrations as Queries to Supervise Episodic MemoryComputer Vision and Pattern Recognition (CVPR), 2023
Santhosh Kumar Ramakrishnan
Ziad Al-Halah
Kristen Grauman
372
47
0
02 Jan 2023
HiTeA: Hierarchical Temporal-Aware Video-Language Pre-training
HiTeA: Hierarchical Temporal-Aware Video-Language Pre-trainingIEEE International Conference on Computer Vision (ICCV), 2022
Qinghao Ye
Guohai Xu
Ming Yan
Haiyang Xu
Qi Qian
Ji Zhang
Fei Huang
VLMAI4TS
547
91
0
30 Dec 2022
Emotion Recognition with Pre-Trained Transformers Using Multimodal
  Signals
Emotion Recognition with Pre-Trained Transformers Using Multimodal SignalsAffective Computing and Intelligent Interaction (ACII), 2022
Juan Vazquez-Rodriguez
G. Lefebvre
Julien Cumin
James L. Crowley
195
15
0
22 Dec 2022
VindLU: A Recipe for Effective Video-and-Language Pretraining
VindLU: A Recipe for Effective Video-and-Language PretrainingComputer Vision and Pattern Recognition (CVPR), 2022
Feng Cheng
Xizi Wang
Jie Lei
David J. Crandall
Joey Tianyi Zhou
Gedas Bertasius
VLM
290
92
0
09 Dec 2022
Tencent AVS: A Holistic Ads Video Dataset for Multi-modal Scene
  Segmentation
Tencent AVS: A Holistic Ads Video Dataset for Multi-modal Scene SegmentationIEEE Access (IEEE Access), 2022
Jie Jiang
Zhimin Li
Jiangfeng Xiong
Rongwei Quan
Qinglin Lu
Wei Liu
199
3
0
09 Dec 2022
Learning Video Representations from Large Language Models
Learning Video Representations from Large Language ModelsComputer Vision and Pattern Recognition (CVPR), 2022
Yue Zhao
Ishan Misra
Philipp Krahenbuhl
Rohit Girdhar
VLMAI4TS
307
231
0
08 Dec 2022
SimVTP: Simple Video Text Pre-training with Masked Autoencoders
SimVTP: Simple Video Text Pre-training with Masked Autoencoders
Yue Ma
Tianyu Yang
Yin Shan
Xiu Li
169
30
0
07 Dec 2022
InternVideo: General Video Foundation Models via Generative and
  Discriminative Learning
InternVideo: General Video Foundation Models via Generative and Discriminative Learning
Yi Wang
Kunchang Li
Yizhuo Li
Yinan He
Bingkun Huang
...
Junting Pan
Jiashuo Yu
Yali Wang
Limin Wang
Yu Qiao
VLMVGen
466
448
0
06 Dec 2022
Muscles in Action
Muscles in ActionIEEE International Conference on Computer Vision (ICCV), 2022
Mia Chiquier
Carl Vondrick
319
1
0
05 Dec 2022
Day2Dark: Pseudo-Supervised Activity Recognition beyond Silent Daylight
Day2Dark: Pseudo-Supervised Activity Recognition beyond Silent DaylightInternational Journal of Computer Vision (IJCV), 2022
Yunhua Zhang
Hazel Doughty
Cees G. M. Snoek
VLM
303
2
0
05 Dec 2022
Masked Contrastive Pre-Training for Efficient Video-Text Retrieval
Masked Contrastive Pre-Training for Efficient Video-Text Retrieval
Fangxun Shu
Biaolong Chen
Yue Liao
Shuwen Xiao
Wenyu Sun
Xiaobo Li
Yousong Zhu
Jinqiao Wang
Si Liu
CLIP
190
13
0
02 Dec 2022
Protein Language Models and Structure Prediction: Connection and
  Progression
Protein Language Models and Structure Prediction: Connection and Progression
Bozhen Hu
Jun Xia
Jiangbin Zheng
Cheng Tan
Yufei Huang
Yongjie Xu
Stan Z. Li
220
46
0
30 Nov 2022
SegCLIP: Patch Aggregation with Learnable Centers for Open-Vocabulary
  Semantic Segmentation
SegCLIP: Patch Aggregation with Learnable Centers for Open-Vocabulary Semantic SegmentationInternational Conference on Machine Learning (ICML), 2022
Huaishao Luo
Junwei Bao
Youzheng Wu
Xiaodong He
Tianrui Li
VLM
248
196
0
27 Nov 2022
Perceiver-VL: Efficient Vision-and-Language Modeling with Iterative
  Latent Attention
Perceiver-VL: Efficient Vision-and-Language Modeling with Iterative Latent AttentionIEEE Workshop/Winter Conference on Applications of Computer Vision (WACV), 2022
Zineng Tang
Jaemin Cho
Jie Lei
Joey Tianyi Zhou
VLM
179
10
0
21 Nov 2022
Expectation-Maximization Contrastive Learning for Compact
  Video-and-Language Representations
Expectation-Maximization Contrastive Learning for Compact Video-and-Language RepresentationsNeural Information Processing Systems (NeurIPS), 2022
Peng Jin
Jinfa Huang
Fenglin Liu
Xian Wu
Shen Ge
Guoli Song
David Clifton
Jing Chen
VLM
307
87
0
21 Nov 2022
Towards All-in-one Pre-training via Maximizing Multi-modal Mutual
  Information
Towards All-in-one Pre-training via Maximizing Multi-modal Mutual InformationComputer Vision and Pattern Recognition (CVPR), 2022
Weijie Su
Xizhou Zhu
Chenxin Tao
Lewei Lu
Bin Li
Gao Huang
Yu Qiao
Xiaogang Wang
Jie Zhou
Jifeng Dai
245
56
0
17 Nov 2022
Language Conditioned Spatial Relation Reasoning for 3D Object Grounding
Language Conditioned Spatial Relation Reasoning for 3D Object GroundingNeural Information Processing Systems (NeurIPS), 2022
Shizhe Chen
Pierre-Louis Guhur
Makarand Tapaswi
Cordelia Schmid
Ivan Laptev
266
128
0
17 Nov 2022
Cross-Modal Adapter for Vision-Language Retrieval
Cross-Modal Adapter for Vision-Language RetrievalPattern Recognition (Pattern Recogn.), 2022
Haojun Jiang
Jianke Zhang
Rui Huang
Chunjiang Ge
Zanlin Ni
Jiwen Lu
Gao Huang
368
43
0
17 Nov 2022
Pragmatics in Language Grounding: Phenomena, Tasks, and Modeling
  Approaches
Pragmatics in Language Grounding: Phenomena, Tasks, and Modeling ApproachesConference on Empirical Methods in Natural Language Processing (EMNLP), 2022
Daniel Fried
Nicholas Tomlin
Jennifer Hu
Roma Patel
Aida Nematzadeh
250
9
0
15 Nov 2022
Grafting Pre-trained Models for Multimodal Headline Generation
Grafting Pre-trained Models for Multimodal Headline GenerationConference on Empirical Methods in Natural Language Processing (EMNLP), 2022
Lingfeng Qiao
Chen Wu
Ye Liu
Haoyuan Peng
Di Yin
Bo Ren
248
6
0
14 Nov 2022
CLOP: Video-and-Language Pre-Training with Knowledge Regularizations
CLOP: Video-and-Language Pre-Training with Knowledge RegularizationsACM Multimedia (ACM MM), 2022
Guohao Li
Hu Yang
Feng He
Zhifan Feng
Yajuan Lyu
Hua Wu
Haifeng Wang
VLM
178
2
0
07 Nov 2022
CASA: Category-agnostic Skeletal Animal Reconstruction
CASA: Category-agnostic Skeletal Animal ReconstructionNeural Information Processing Systems (NeurIPS), 2022
Yuefan Wu
Ze-Yin Chen
Shao-Wei Liu
Zhongzheng Ren
Shenlong Wang
261
41
0
04 Nov 2022
Scaling Multimodal Pre-Training via Cross-Modality Gradient
  Harmonization
Scaling Multimodal Pre-Training via Cross-Modality Gradient HarmonizationNeural Information Processing Systems (NeurIPS), 2022
Junru Wu
Yi Liang
Feng Han
Hassan Akbari
Zinan Lin
Cong Yu
153
14
0
03 Nov 2022
DiMBERT: Learning Vision-Language Grounded Representations with
  Disentangled Multimodal-Attention
DiMBERT: Learning Vision-Language Grounded Representations with Disentangled Multimodal-AttentionACM Transactions on Knowledge Discovery from Data (TKDD), 2021
Fenglin Liu
Xian Wu
Shen Ge
Xuancheng Ren
Wei Fan
Xu Sun
Yuexian Zou
VLM
209
13
0
28 Oct 2022
End-to-End Multimodal Representation Learning for Video Dialog
End-to-End Multimodal Representation Learning for Video Dialog
Huda AlAmri
Anthony Bilic
Michael Hu
Apoorva Beedu
Irfan Essa
213
7
0
26 Oct 2022
Fine-grained Semantic Alignment Network for Weakly Supervised Temporal
  Language Grounding
Fine-grained Semantic Alignment Network for Weakly Supervised Temporal Language GroundingConference on Empirical Methods in Natural Language Processing (EMNLP), 2022
Yuechen Wang
Wen-gang Zhou
Houqiang Li
AI4TS
155
14
0
21 Oct 2022
LiteVL: Efficient Video-Language Learning with Enhanced Spatial-Temporal
  Modeling
LiteVL: Efficient Video-Language Learning with Enhanced Spatial-Temporal ModelingConference on Empirical Methods in Natural Language Processing (EMNLP), 2022
Dongsheng Chen
Chaofan Tao
Lu Hou
Lifeng Shang
Xin Jiang
Qun Liu
VLM
251
19
0
21 Oct 2022
H4VDM: H.264 Video Device Matching
H4VDM: H.264 Video Device Matching
Ziyue Xiang
Paolo Bestagini
Stefano Tubaro
Edward J. Delp
116
1
0
20 Oct 2022
Visual Spatial Description: Controlled Spatial-Oriented Image-to-Text
  Generation
Visual Spatial Description: Controlled Spatial-Oriented Image-to-Text GenerationConference on Empirical Methods in Natural Language Processing (EMNLP), 2022
Yu Zhao
Jianguo Wei
Zhichao Lin
Yueheng Sun
Meishan Zhang
Hao Fei
193
17
0
20 Oct 2022
Grounded Video Situation Recognition
Grounded Video Situation RecognitionNeural Information Processing Systems (NeurIPS), 2022
Zeeshan Khan
C. V. Jawahar
Makarand Tapaswi
192
16
0
19 Oct 2022
VTC: Improving Video-Text Retrieval with User Comments
VTC: Improving Video-Text Retrieval with User CommentsEuropean Conference on Computer Vision (ECCV), 2022
Laura Hanu
James Thewlis
Yuki M. Asano
Christian Rupprecht
VGen
245
8
0
19 Oct 2022
Video in 10 Bits: Few-Bit VideoQA for Efficiency and Privacy
Video in 10 Bits: Few-Bit VideoQA for Efficiency and Privacy
Shiyuan Huang
Robinson Piramuthu
Shih-Fu Chang
Gunnar Sigurdsson
202
1
0
15 Oct 2022
Plausible May Not Be Faithful: Probing Object Hallucination in
  Vision-Language Pre-training
Plausible May Not Be Faithful: Probing Object Hallucination in Vision-Language Pre-trainingConference of the European Chapter of the Association for Computational Linguistics (EACL), 2022
Wenliang Dai
Zihan Liu
Ziwei Ji
Jane Polak Scowcroft
Pascale Fung
MLLMVLM
313
76
0
14 Oct 2022
Can Language Representation Models Think in Bets?
Can Language Representation Models Think in Bets?Royal Society Open Science (RSOS), 2022
Zhi–Bin Tang
Mayank Kejriwal
159
7
0
14 Oct 2022
RaP: Redundancy-aware Video-language Pre-training for Text-Video
  Retrieval
RaP: Redundancy-aware Video-language Pre-training for Text-Video RetrievalConference on Empirical Methods in Natural Language Processing (EMNLP), 2022
Xing Wu
Chaochen Gao
Zijia Lin
Zhongyuan Wang
Jizhong Han
Songlin Hu
162
10
0
13 Oct 2022
Long-Form Video-Language Pre-Training with Multimodal Temporal
  Contrastive Learning
Long-Form Video-Language Pre-Training with Multimodal Temporal Contrastive LearningNeural Information Processing Systems (NeurIPS), 2022
Yuchong Sun
Hongwei Xue
Ruihua Song
Bei Liu
Huan Yang
Jianlong Fu
AI4TSVLM
297
84
0
12 Oct 2022
Contrastive Video-Language Learning with Fine-grained Frame Sampling
Contrastive Video-Language Learning with Fine-grained Frame Sampling
Zixu Wang
Yujie Zhong
Yishu Miao
Lin Ma
Lucia Specia
233
15
0
10 Oct 2022
Generating Executable Action Plans with Environmentally-Aware Language
  Models
Generating Executable Action Plans with Environmentally-Aware Language ModelsIEEE/RJS International Conference on Intelligent RObots and Systems (IROS), 2022
Maitrey Gramopadhye
D. Szafir
LM&RoLLMAG
325
38
0
10 Oct 2022
ConTra: (Con)text (Tra)nsformer for Cross-Modal Video Retrieval
ConTra: (Con)text (Tra)nsformer for Cross-Modal Video RetrievalAsian Conference on Computer Vision (ACCV), 2022
A. Fragomeni
Michael Wray
Dima Damen
CLIPViT
158
4
0
09 Oct 2022
Learning Fine-Grained Visual Understanding for Video Question Answering
  via Decoupling Spatial-Temporal Modeling
Learning Fine-Grained Visual Understanding for Video Question Answering via Decoupling Spatial-Temporal ModelingBritish Machine Vision Conference (BMVC), 2022
Hsin-Ying Lee
Hung-Ting Su
Bing-Chen Tsai
Tsung-Han Wu
Jia-Fong Yeh
Winston H. Hsu
312
2
0
08 Oct 2022
See, Plan, Predict: Language-guided Cognitive Planning with Video
  Prediction
See, Plan, Predict: Language-guided Cognitive Planning with Video Prediction
Maria Attarian
Advaya Gupta
Ziyi Zhou
Wei Yu
Igor Gilitschenski
Animesh Garg
LM&Ro
213
8
0
07 Oct 2022
Visualize Before You Write: Imagination-Guided Open-Ended Text
  Generation
Visualize Before You Write: Imagination-Guided Open-Ended Text GenerationFindings (Findings), 2022
Wanrong Zhu
An Yan
Yujie Lu
Wenda Xu
Xinze Wang
Miguel P. Eckstein
William Yang Wang
324
38
0
07 Oct 2022
Understanding Prior Bias and Choice Paralysis in Transformer-based
  Language Representation Models through Four Experimental Probes
Understanding Prior Bias and Choice Paralysis in Transformer-based Language Representation Models through Four Experimental Probes
Ke Shen
Mayank Kejriwal
192
4
0
03 Oct 2022
CLIP2Point: Transfer CLIP to Point Cloud Classification with Image-Depth
  Pre-training
CLIP2Point: Transfer CLIP to Point Cloud Classification with Image-Depth Pre-trainingIEEE International Conference on Computer Vision (ICCV), 2022
Tianyu Huang
Bowen Dong
Yunhan Yang
Xiaoshui Huang
Rynson W. H. Lau
Wanli Ouyang
W. Zuo
VLM3DPCCLIP
580
199
0
03 Oct 2022
Previous
123...678...151617
Next
Page 7 of 17
Pageof 17