Communities
Connect sessions
AI calendar
Organizations
Join Slack
Contact Sales
Search
Open menu
Home
Papers
1904.01766
Cited By
v1
v2 (latest)
VideoBERT: A Joint Model for Video and Language Representation Learning
3 April 2019
Chen Sun
Austin Myers
Carl Vondrick
Kevin Patrick Murphy
Cordelia Schmid
VLM
SSL
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"VideoBERT: A Joint Model for Video and Language Representation Learning"
50 / 803 papers shown
NER-BERT: A Pre-trained Model for Low-Resource Entity Tagging
Zihan Liu
Feijun Jiang
Yuxiang Hu
Chen Shi
Pascale Fung
304
43
0
01 Dec 2021
CLIP Meets Video Captioning: Concept-Aware Representation Learning Does Matter
Bang-ju Yang
Tong Zhang
Yuexian Zou
CLIP
143
26
0
30 Nov 2021
ContIG: Self-supervised Multimodal Contrastive Learning for Medical Imaging with Genetics
Computer Vision and Pattern Recognition (CVPR), 2021
Aiham Taleb
Matthias Kirchler
Remo Monti
Christoph Lippert
SSL
MedIm
609
69
0
26 Nov 2021
SwinBERT: End-to-End Transformers with Sparse Attention for Video Captioning
Computer Vision and Pattern Recognition (CVPR), 2021
Kevin Qinghong Lin
Linjie Li
Chung-Ching Lin
Faisal Ahmed
Zhe Gan
Zicheng Liu
Yumao Lu
Lijuan Wang
ViT
351
303
0
25 Nov 2021
VIOLET : End-to-End Video-Language Transformers with Masked Visual-token Modeling
Tsu-Jui Fu
Linjie Li
Zhe Gan
Kevin Qinghong Lin
Wenjie Wang
Lijuan Wang
Zicheng Liu
VLM
405
240
0
24 Nov 2021
Hierarchical Modular Network for Video Captioning
Hanhua Ye
Guorong Li
Yuankai Qi
Shuhui Wang
Qingming Huang
Ming-Hsuan Yang
230
90
0
24 Nov 2021
Scaling Up Vision-Language Pre-training for Image Captioning
Xiaowei Hu
Zhe Gan
Jianfeng Wang
Zhengyuan Yang
Zicheng Liu
Yumao Lu
Lijuan Wang
MLLM
VLM
423
300
0
24 Nov 2021
Multi-Person 3D Motion Prediction with Multi-Range Transformers
Jiashun Wang
Huazhe Xu
Medhini Narasimhan
Xiaolong Wang
ViT
252
92
0
23 Nov 2021
Towards Tokenized Human Dynamics Representation
Kenneth Li
Xiao Sun
Zhirong Wu
Fangyun Wei
Stephen Lin
219
3
0
22 Nov 2021
Class-agnostic Object Detection with Multi-modal Transformer
European Conference on Computer Vision (ECCV), 2021
Muhammad Maaz
H. Rasheed
Salman Khan
Fahad Shahbaz Khan
Rao Muhammad Anwer
Ming-Hsuan Yang
623
116
0
22 Nov 2021
Advancing High-Resolution Video-Language Representation with Large-Scale Video Transcriptions
Computer Vision and Pattern Recognition (CVPR), 2021
Hongwei Xue
Tiankai Hang
Yanhong Zeng
Yuchong Sun
Bei Liu
Huan Yang
Jianlong Fu
B. Guo
AI4TS
VLM
253
253
0
19 Nov 2021
DVCFlow: Modeling Information Flow Towards Human-like Video Captioning
Xu Yan
Zhengcong Fei
Shuhui Wang
Qingming Huang
Qi Tian
VGen
260
4
0
19 Nov 2021
A Survey of Visual Transformers
IEEE Transactions on Neural Networks and Learning Systems (TNNLS), 2021
Yang Liu
Yao Zhang
Yixin Wang
Feng Hou
Jin Yuan
Jiang Tian
Yang Zhang
Peng Wang
Jianping Fan
Zhiqiang He
3DGS
ViT
473
487
0
11 Nov 2021
CLIP2TV: Align, Match and Distill for Video-Text Retrieval
Zijian Gao
Qingbin Liu
Weiqi Sun
S. Chen
Dedan Chang
Lili Zhao
VLM
CLIP
132
26
0
10 Nov 2021
Machine Learning for Multimodal Electronic Health Records-based Research: Challenges and Perspectives
Ziyi Liu
Jiaqi Zhang
Yongshuai Hou
Xinran Zhang
Ge Li
Yang Xiang
297
18
0
09 Nov 2021
NarrationBot and InfoBot: A Hybrid System for Automated Video Description
Shasta Ihorn
Y. Siu
Aditya Bodi
Lothar D Narins
Jose M. Castanon
Yash Kant
Abhishek Das
Ilmi Yoon
Pooyan Fazli
110
6
0
07 Nov 2021
Benchmarking Multimodal AutoML for Tabular Data with Text Fields
Xingjian Shi
Jonas W. Mueller
Nick Erickson
Mu Li
Alexander J. Smola
LMTD
155
39
0
04 Nov 2021
Revisiting spatio-temporal layouts for compositional action recognition
British Machine Vision Conference (BMVC), 2021
Gorjan Radevski
Marie-Francine Moens
Tinne Tuytelaars
212
30
0
02 Nov 2021
Masking Modalities for Cross-modal Video Retrieval
IEEE Workshop/Winter Conference on Applications of Computer Vision (WACV), 2021
Valentin Gabeur
Arsha Nagrani
Chen Sun
Alahari Karteek
Cordelia Schmid
298
31
0
01 Nov 2021
With a Little Help from my Temporal Context: Multimodal Egocentric Action Recognition
British Machine Vision Conference (BMVC), 2021
Evangelos Kazakos
Jaesung Huh
Arsha Nagrani
Andrew Zisserman
Dima Damen
EgoV
297
54
0
01 Nov 2021
Cross-Modality Fusion Transformer for Multispectral Object Detection
Social Science Research Network (SSRN), 2021
Q. Fang
D. Han
Zhaokui Wang
ViT
301
269
0
30 Oct 2021
MEmoBERT: Pre-training Model with Prompt-based Learning for Multimodal Emotion Recognition
Jinming Zhao
Ruichen Li
Qin Jin
Xinchao Wang
Haizhou Li
145
38
0
27 Oct 2021
Multimodal Learning using Optimal Transport for Sarcasm and Humor Detection
IEEE Workshop/Winter Conference on Applications of Computer Vision (WACV), 2021
Shraman Pramanick
A. Roy
Vishal M. Patel
205
83
0
21 Oct 2021
Toward Accurate and Reliable Iris Segmentation Using Uncertainty Learning
Jianze Wei
Huaibo Huang
Muyi Sun
Yunlong Wang
Min Ren
Ran He
Zhenan Sun
165
8
0
20 Oct 2021
Energon: Towards Efficient Acceleration of Transformers Using Dynamic Sparse Attention
Zhe Zhou
Junling Liu
Zhenyu Gu
Guangyu Sun
268
62
0
18 Oct 2021
Understanding Multimodal Procedural Knowledge by Sequencing Multimodal Instructional Manuals
Te-Lin Wu
Alexander Spangher
Pegah Alipoormolabashi
Marjorie Freedman
R. Weischedel
Nanyun Peng
278
28
0
16 Oct 2021
Semantically Distributed Robust Optimization for Vision-and-Language Inference
Tejas Gokhale
A. Chaudhary
Pratyay Banerjee
Chitta Baral
Yezhou Yang
326
18
0
14 Oct 2021
A CLIP-Enhanced Method for Video-Language Understanding
Guohao Li
Feng He
Zhifan Feng
CLIP
127
12
0
14 Oct 2021
Multi-Modal Pre-Training for Automated Speech Recognition
IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2021
David M. Chan
Shalini Ghosh
D. Chakrabarty
Björn Hoffmeister
SSL
234
16
0
12 Oct 2021
Vit-GAN: Image-to-image Translation with Vision Transformes and Conditional GANS
Yigit Gündüç
ViT
100
3
0
11 Oct 2021
SignBERT: Pre-Training of Hand-Model-Aware Representation for Sign Language Recognition
IEEE International Conference on Computer Vision (ICCV), 2021
Hezhen Hu
Weichao Zhao
Wen-gang Zhou
Yuechen Wang
Houqiang Li
ViT
263
109
0
11 Oct 2021
Pretrained Language Models are Symbolic Mathematics Solvers too!
Kimia Noorbakhsh
Modar Sulaiman
M. Sharifi
Kallol Roy
Pooyan Jamshidi
LRM
292
22
0
07 Oct 2021
Attention is All You Need? Good Embeddings with Statistics are enough:Large Scale Audio Understanding without Transformers/ Convolutions/ BERTs/ Mixers/ Attention/ RNNs or ....
Prateek Verma
AI4TS
281
3
0
07 Oct 2021
Tensor-to-Image: Image-to-Image Translation with Vision Transformers
Y. Gündüç
ViT
94
6
0
06 Oct 2021
ProTo: Program-Guided Transformer for Program-Guided Tasks
Zelin Zhao
Karan Samel
Binghong Chen
Le Song
ViT
LM&Ro
260
32
0
02 Oct 2021
CrossCLR: Cross-modal Contrastive Learning For Multi-modal Video Representations
Mohammadreza Zolfaghari
Yi Zhu
Peter V. Gehler
Thomas Brox
332
148
0
30 Sep 2021
IntentVizor: Towards Generic Query Guided Interactive Video Summarization
Guande Wu
Jianzhe Lin
Claudio T. Silva
230
35
0
30 Sep 2021
VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding
Hu Xu
Gargi Ghosh
Po-Yao (Bernie) Huang
Dmytro Okhonko
Armen Aghajanyan
Florian Metze
Luke Zettlemoyer
Florian Metze Luke Zettlemoyer Christoph Feichtenhofer
CLIP
VLM
830
694
0
28 Sep 2021
Audio-to-Image Cross-Modal Generation
IEEE International Joint Conference on Neural Network (IJCNN), 2021
Maciej Żelaszczyk
Jacek Mańdziuk
DiffM
202
19
0
27 Sep 2021
Self-Supervised Video Representation Learning by Video Incoherence Detection
IEEE Transactions on Cybernetics (IEEE Trans. Cybern.), 2021
Haozhi Cao
Yuecong Xu
Jianfei Yang
K. Mao
Lihua Xie
Jianxiong Yin
Simon See
SSL
121
8
0
26 Sep 2021
CPT: Colorful Prompt Tuning for Pre-trained Vision-Language Models
Yuan Yao
Ao Zhang
Zhengyan Zhang
Zhiyuan Liu
Tat-Seng Chua
Maosong Sun
MLLM
VPVLM
VLM
589
244
0
24 Sep 2021
Pairwise Emotional Relationship Recognition in Drama Videos: Dataset and Benchmark
ACM Multimedia (ACM MM), 2021
Xun Gao
Yin Zhao
Jie Zhang
Longjun Cai
136
9
0
23 Sep 2021
Does Vision-and-Language Pretraining Improve Lexical Grounding?
Tian Yun
Chen Sun
Ellie Pavlick
VLM
CoGe
236
36
0
21 Sep 2021
Survey: Transformer based Video-Language Pre-training
Ludan Ruan
Qin Jin
VLM
ViT
210
50
0
21 Sep 2021
Overview of Tencent Multi-modal Ads Video Understanding Challenge
Zhenzhi Wang
Liyu Wu
Zhimin Li
Jiangfeng Xiong
Qinglin Lu
147
5
0
16 Sep 2021
Cross-lingual Transfer of Monolingual Models
Evangelia Gogoulou
Ariel Ekgren
T. Isbister
Magnus Sahlgren
256
20
0
15 Sep 2021
Can Language Models Encode Perceptual Structure Without Grounding? A Case Study in Color
Mostafa Abdou
Artur Kulmizev
Daniel Hershcovich
Stella Frank
Ellie Pavlick
Anders Søgaard
215
159
0
13 Sep 2021
A Survey on Multi-modal Summarization
Anubhav Jangra
Sourajit Mukherjee
Adam Jatowt
S. Saha
M. Hasanuzzaman
206
79
0
11 Sep 2021
PlaTe: Visually-Grounded Planning with Transformers in Procedural Tasks
IEEE Robotics and Automation Letters (RA-L), 2021
Jiankai Sun
De-An Huang
Bo Lu
Yunhui Liu
Bolei Zhou
Animesh Garg
180
62
0
10 Sep 2021
M5Product: Self-harmonized Contrastive Learning for E-commercial Multi-modal Pretraining
Computer Vision and Pattern Recognition (CVPR), 2021
Xiao Dong
Xunlin Zhan
Yangxin Wu
Yunchao Wei
Michael C. Kampffmeyer
Xiaoyong Wei
Minlong Lu
Yaowei Wang
Xiaodan Liang
586
46
0
09 Sep 2021
Previous
1
2
3
...
10
11
12
...
15
16
17
Next
Page 11 of 17
Page
of 17
Go