ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2026 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 1904.01766
  4. Cited By
VideoBERT: A Joint Model for Video and Language Representation Learning
v1v2 (latest)

VideoBERT: A Joint Model for Video and Language Representation Learning

3 April 2019
Chen Sun
Austin Myers
Carl Vondrick
Kevin Patrick Murphy
Cordelia Schmid
    VLMSSL
ArXiv (abs)PDFHTML

Papers citing "VideoBERT: A Joint Model for Video and Language Representation Learning"

50 / 803 papers shown
NER-BERT: A Pre-trained Model for Low-Resource Entity Tagging
NER-BERT: A Pre-trained Model for Low-Resource Entity Tagging
Zihan Liu
Feijun Jiang
Yuxiang Hu
Chen Shi
Pascale Fung
304
43
0
01 Dec 2021
CLIP Meets Video Captioning: Concept-Aware Representation Learning Does
  Matter
CLIP Meets Video Captioning: Concept-Aware Representation Learning Does Matter
Bang-ju Yang
Tong Zhang
Yuexian Zou
CLIP
143
26
0
30 Nov 2021
ContIG: Self-supervised Multimodal Contrastive Learning for Medical
  Imaging with Genetics
ContIG: Self-supervised Multimodal Contrastive Learning for Medical Imaging with GeneticsComputer Vision and Pattern Recognition (CVPR), 2021
Aiham Taleb
Matthias Kirchler
Remo Monti
Christoph Lippert
SSLMedIm
609
69
0
26 Nov 2021
SwinBERT: End-to-End Transformers with Sparse Attention for Video
  Captioning
SwinBERT: End-to-End Transformers with Sparse Attention for Video CaptioningComputer Vision and Pattern Recognition (CVPR), 2021
Kevin Qinghong Lin
Linjie Li
Chung-Ching Lin
Faisal Ahmed
Zhe Gan
Zicheng Liu
Yumao Lu
Lijuan Wang
ViT
351
303
0
25 Nov 2021
VIOLET : End-to-End Video-Language Transformers with Masked Visual-token
  Modeling
VIOLET : End-to-End Video-Language Transformers with Masked Visual-token Modeling
Tsu-Jui Fu
Linjie Li
Zhe Gan
Kevin Qinghong Lin
Wenjie Wang
Lijuan Wang
Zicheng Liu
VLM
405
240
0
24 Nov 2021
Hierarchical Modular Network for Video Captioning
Hierarchical Modular Network for Video Captioning
Hanhua Ye
Guorong Li
Yuankai Qi
Shuhui Wang
Qingming Huang
Ming-Hsuan Yang
230
90
0
24 Nov 2021
Scaling Up Vision-Language Pre-training for Image Captioning
Scaling Up Vision-Language Pre-training for Image Captioning
Xiaowei Hu
Zhe Gan
Jianfeng Wang
Zhengyuan Yang
Zicheng Liu
Yumao Lu
Lijuan Wang
MLLMVLM
423
300
0
24 Nov 2021
Multi-Person 3D Motion Prediction with Multi-Range Transformers
Multi-Person 3D Motion Prediction with Multi-Range Transformers
Jiashun Wang
Huazhe Xu
Medhini Narasimhan
Xiaolong Wang
ViT
252
92
0
23 Nov 2021
Towards Tokenized Human Dynamics Representation
Towards Tokenized Human Dynamics Representation
Kenneth Li
Xiao Sun
Zhirong Wu
Fangyun Wei
Stephen Lin
219
3
0
22 Nov 2021
Class-agnostic Object Detection with Multi-modal Transformer
Class-agnostic Object Detection with Multi-modal TransformerEuropean Conference on Computer Vision (ECCV), 2021
Muhammad Maaz
H. Rasheed
Salman Khan
Fahad Shahbaz Khan
Rao Muhammad Anwer
Ming-Hsuan Yang
623
116
0
22 Nov 2021
Advancing High-Resolution Video-Language Representation with Large-Scale
  Video Transcriptions
Advancing High-Resolution Video-Language Representation with Large-Scale Video TranscriptionsComputer Vision and Pattern Recognition (CVPR), 2021
Hongwei Xue
Tiankai Hang
Yanhong Zeng
Yuchong Sun
Bei Liu
Huan Yang
Jianlong Fu
B. Guo
AI4TSVLM
253
253
0
19 Nov 2021
DVCFlow: Modeling Information Flow Towards Human-like Video Captioning
DVCFlow: Modeling Information Flow Towards Human-like Video Captioning
Xu Yan
Zhengcong Fei
Shuhui Wang
Qingming Huang
Qi Tian
VGen
260
4
0
19 Nov 2021
A Survey of Visual Transformers
A Survey of Visual TransformersIEEE Transactions on Neural Networks and Learning Systems (TNNLS), 2021
Yang Liu
Yao Zhang
Yixin Wang
Feng Hou
Jin Yuan
Jiang Tian
Yang Zhang
Peng Wang
Jianping Fan
Zhiqiang He
3DGSViT
473
487
0
11 Nov 2021
CLIP2TV: Align, Match and Distill for Video-Text Retrieval
CLIP2TV: Align, Match and Distill for Video-Text Retrieval
Zijian Gao
Qingbin Liu
Weiqi Sun
S. Chen
Dedan Chang
Lili Zhao
VLMCLIP
132
26
0
10 Nov 2021
Machine Learning for Multimodal Electronic Health Records-based
  Research: Challenges and Perspectives
Machine Learning for Multimodal Electronic Health Records-based Research: Challenges and Perspectives
Ziyi Liu
Jiaqi Zhang
Yongshuai Hou
Xinran Zhang
Ge Li
Yang Xiang
297
18
0
09 Nov 2021
NarrationBot and InfoBot: A Hybrid System for Automated Video Description
Shasta Ihorn
Y. Siu
Aditya Bodi
Lothar D Narins
Jose M. Castanon
Yash Kant
Abhishek Das
Ilmi Yoon
Pooyan Fazli
110
6
0
07 Nov 2021
Benchmarking Multimodal AutoML for Tabular Data with Text Fields
Benchmarking Multimodal AutoML for Tabular Data with Text Fields
Xingjian Shi
Jonas W. Mueller
Nick Erickson
Mu Li
Alexander J. Smola
LMTD
155
39
0
04 Nov 2021
Revisiting spatio-temporal layouts for compositional action recognition
Revisiting spatio-temporal layouts for compositional action recognitionBritish Machine Vision Conference (BMVC), 2021
Gorjan Radevski
Marie-Francine Moens
Tinne Tuytelaars
212
30
0
02 Nov 2021
Masking Modalities for Cross-modal Video Retrieval
Masking Modalities for Cross-modal Video RetrievalIEEE Workshop/Winter Conference on Applications of Computer Vision (WACV), 2021
Valentin Gabeur
Arsha Nagrani
Chen Sun
Alahari Karteek
Cordelia Schmid
298
31
0
01 Nov 2021
With a Little Help from my Temporal Context: Multimodal Egocentric
  Action Recognition
With a Little Help from my Temporal Context: Multimodal Egocentric Action RecognitionBritish Machine Vision Conference (BMVC), 2021
Evangelos Kazakos
Jaesung Huh
Arsha Nagrani
Andrew Zisserman
Dima Damen
EgoV
297
54
0
01 Nov 2021
Cross-Modality Fusion Transformer for Multispectral Object Detection
Cross-Modality Fusion Transformer for Multispectral Object DetectionSocial Science Research Network (SSRN), 2021
Q. Fang
D. Han
Zhaokui Wang
ViT
301
269
0
30 Oct 2021
MEmoBERT: Pre-training Model with Prompt-based Learning for Multimodal
  Emotion Recognition
MEmoBERT: Pre-training Model with Prompt-based Learning for Multimodal Emotion Recognition
Jinming Zhao
Ruichen Li
Qin Jin
Xinchao Wang
Haizhou Li
145
38
0
27 Oct 2021
Multimodal Learning using Optimal Transport for Sarcasm and Humor
  Detection
Multimodal Learning using Optimal Transport for Sarcasm and Humor DetectionIEEE Workshop/Winter Conference on Applications of Computer Vision (WACV), 2021
Shraman Pramanick
A. Roy
Vishal M. Patel
205
83
0
21 Oct 2021
Toward Accurate and Reliable Iris Segmentation Using Uncertainty
  Learning
Toward Accurate and Reliable Iris Segmentation Using Uncertainty Learning
Jianze Wei
Huaibo Huang
Muyi Sun
Yunlong Wang
Min Ren
Ran He
Zhenan Sun
165
8
0
20 Oct 2021
Energon: Towards Efficient Acceleration of Transformers Using Dynamic
  Sparse Attention
Energon: Towards Efficient Acceleration of Transformers Using Dynamic Sparse Attention
Zhe Zhou
Junling Liu
Zhenyu Gu
Guangyu Sun
268
62
0
18 Oct 2021
Understanding Multimodal Procedural Knowledge by Sequencing Multimodal
  Instructional Manuals
Understanding Multimodal Procedural Knowledge by Sequencing Multimodal Instructional Manuals
Te-Lin Wu
Alexander Spangher
Pegah Alipoormolabashi
Marjorie Freedman
R. Weischedel
Nanyun Peng
278
28
0
16 Oct 2021
Semantically Distributed Robust Optimization for Vision-and-Language
  Inference
Semantically Distributed Robust Optimization for Vision-and-Language Inference
Tejas Gokhale
A. Chaudhary
Pratyay Banerjee
Chitta Baral
Yezhou Yang
326
18
0
14 Oct 2021
A CLIP-Enhanced Method for Video-Language Understanding
A CLIP-Enhanced Method for Video-Language Understanding
Guohao Li
Feng He
Zhifan Feng
CLIP
127
12
0
14 Oct 2021
Multi-Modal Pre-Training for Automated Speech Recognition
Multi-Modal Pre-Training for Automated Speech RecognitionIEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2021
David M. Chan
Shalini Ghosh
D. Chakrabarty
Björn Hoffmeister
SSL
234
16
0
12 Oct 2021
Vit-GAN: Image-to-image Translation with Vision Transformes and
  Conditional GANS
Vit-GAN: Image-to-image Translation with Vision Transformes and Conditional GANS
Yigit Gündüç
ViT
100
3
0
11 Oct 2021
SignBERT: Pre-Training of Hand-Model-Aware Representation for Sign
  Language Recognition
SignBERT: Pre-Training of Hand-Model-Aware Representation for Sign Language RecognitionIEEE International Conference on Computer Vision (ICCV), 2021
Hezhen Hu
Weichao Zhao
Wen-gang Zhou
Yuechen Wang
Houqiang Li
ViT
263
109
0
11 Oct 2021
Pretrained Language Models are Symbolic Mathematics Solvers too!
Pretrained Language Models are Symbolic Mathematics Solvers too!
Kimia Noorbakhsh
Modar Sulaiman
M. Sharifi
Kallol Roy
Pooyan Jamshidi
LRM
292
22
0
07 Oct 2021
Attention is All You Need? Good Embeddings with Statistics are
  enough:Large Scale Audio Understanding without Transformers/ Convolutions/
  BERTs/ Mixers/ Attention/ RNNs or ....
Attention is All You Need? Good Embeddings with Statistics are enough:Large Scale Audio Understanding without Transformers/ Convolutions/ BERTs/ Mixers/ Attention/ RNNs or ....
Prateek Verma
AI4TS
281
3
0
07 Oct 2021
Tensor-to-Image: Image-to-Image Translation with Vision Transformers
Tensor-to-Image: Image-to-Image Translation with Vision Transformers
Y. Gündüç
ViT
94
6
0
06 Oct 2021
ProTo: Program-Guided Transformer for Program-Guided Tasks
ProTo: Program-Guided Transformer for Program-Guided Tasks
Zelin Zhao
Karan Samel
Binghong Chen
Le Song
ViTLM&Ro
260
32
0
02 Oct 2021
CrossCLR: Cross-modal Contrastive Learning For Multi-modal Video
  Representations
CrossCLR: Cross-modal Contrastive Learning For Multi-modal Video Representations
Mohammadreza Zolfaghari
Yi Zhu
Peter V. Gehler
Thomas Brox
332
148
0
30 Sep 2021
IntentVizor: Towards Generic Query Guided Interactive Video
  Summarization
IntentVizor: Towards Generic Query Guided Interactive Video Summarization
Guande Wu
Jianzhe Lin
Claudio T. Silva
230
35
0
30 Sep 2021
VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text
  Understanding
VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding
Hu Xu
Gargi Ghosh
Po-Yao (Bernie) Huang
Dmytro Okhonko
Armen Aghajanyan
Florian Metze
Luke Zettlemoyer
Florian Metze Luke Zettlemoyer Christoph Feichtenhofer
CLIPVLM
830
694
0
28 Sep 2021
Audio-to-Image Cross-Modal Generation
Audio-to-Image Cross-Modal GenerationIEEE International Joint Conference on Neural Network (IJCNN), 2021
Maciej Żelaszczyk
Jacek Mańdziuk
DiffM
202
19
0
27 Sep 2021
Self-Supervised Video Representation Learning by Video Incoherence
  Detection
Self-Supervised Video Representation Learning by Video Incoherence DetectionIEEE Transactions on Cybernetics (IEEE Trans. Cybern.), 2021
Haozhi Cao
Yuecong Xu
Jianfei Yang
K. Mao
Lihua Xie
Jianxiong Yin
Simon See
SSL
121
8
0
26 Sep 2021
CPT: Colorful Prompt Tuning for Pre-trained Vision-Language Models
CPT: Colorful Prompt Tuning for Pre-trained Vision-Language Models
Yuan Yao
Ao Zhang
Zhengyan Zhang
Zhiyuan Liu
Tat-Seng Chua
Maosong Sun
MLLMVPVLMVLM
589
244
0
24 Sep 2021
Pairwise Emotional Relationship Recognition in Drama Videos: Dataset and
  Benchmark
Pairwise Emotional Relationship Recognition in Drama Videos: Dataset and BenchmarkACM Multimedia (ACM MM), 2021
Xun Gao
Yin Zhao
Jie Zhang
Longjun Cai
136
9
0
23 Sep 2021
Does Vision-and-Language Pretraining Improve Lexical Grounding?
Does Vision-and-Language Pretraining Improve Lexical Grounding?
Tian Yun
Chen Sun
Ellie Pavlick
VLMCoGe
236
36
0
21 Sep 2021
Survey: Transformer based Video-Language Pre-training
Survey: Transformer based Video-Language Pre-training
Ludan Ruan
Qin Jin
VLMViT
210
50
0
21 Sep 2021
Overview of Tencent Multi-modal Ads Video Understanding Challenge
Overview of Tencent Multi-modal Ads Video Understanding Challenge
Zhenzhi Wang
Liyu Wu
Zhimin Li
Jiangfeng Xiong
Qinglin Lu
147
5
0
16 Sep 2021
Cross-lingual Transfer of Monolingual Models
Cross-lingual Transfer of Monolingual Models
Evangelia Gogoulou
Ariel Ekgren
T. Isbister
Magnus Sahlgren
256
20
0
15 Sep 2021
Can Language Models Encode Perceptual Structure Without Grounding? A
  Case Study in Color
Can Language Models Encode Perceptual Structure Without Grounding? A Case Study in Color
Mostafa Abdou
Artur Kulmizev
Daniel Hershcovich
Stella Frank
Ellie Pavlick
Anders Søgaard
215
159
0
13 Sep 2021
A Survey on Multi-modal Summarization
A Survey on Multi-modal Summarization
Anubhav Jangra
Sourajit Mukherjee
Adam Jatowt
S. Saha
M. Hasanuzzaman
206
79
0
11 Sep 2021
PlaTe: Visually-Grounded Planning with Transformers in Procedural Tasks
PlaTe: Visually-Grounded Planning with Transformers in Procedural TasksIEEE Robotics and Automation Letters (RA-L), 2021
Jiankai Sun
De-An Huang
Bo Lu
Yunhui Liu
Bolei Zhou
Animesh Garg
180
62
0
10 Sep 2021
M5Product: Self-harmonized Contrastive Learning for E-commercial
  Multi-modal Pretraining
M5Product: Self-harmonized Contrastive Learning for E-commercial Multi-modal PretrainingComputer Vision and Pattern Recognition (CVPR), 2021
Xiao Dong
Xunlin Zhan
Yangxin Wu
Yunchao Wei
Michael C. Kampffmeyer
Xiaoyong Wei
Minlong Lu
Yaowei Wang
Xiaodan Liang
586
46
0
09 Sep 2021
Previous
123...101112...151617
Next
Page 11 of 17
Pageof 17