Communities
Connect sessions
AI calendar
Organizations
Join Slack
Contact Sales
Search
Open menu
Home
Papers
1904.01766
Cited By
v1
v2 (latest)
VideoBERT: A Joint Model for Video and Language Representation Learning
3 April 2019
Chen Sun
Austin Myers
Carl Vondrick
Kevin Patrick Murphy
Cordelia Schmid
VLM
SSL
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"VideoBERT: A Joint Model for Video and Language Representation Learning"
50 / 803 papers shown
ERNIE-GeoL: A Geography-and-Language Pre-trained Model and its Applications in Baidu Maps
Knowledge Discovery and Data Mining (KDD), 2022
Jizhou Huang
Haifeng Wang
Yibo Sun
Yunsheng Shi
Zhengjie Huang
An Zhuo
Shikun Feng
212
56
0
17 Mar 2022
Object discovery and representation networks
European Conference on Computer Vision (ECCV), 2022
Olivier J. Hénaff
Skanda Koppula
Evan Shelhamer
Daniel Zoran
Andrew Jaegle
Andrew Zisserman
João Carreira
Relja Arandjelović
425
95
0
16 Mar 2022
Geographic Adaptation of Pretrained Language Models
Transactions of the Association for Computational Linguistics (TACL), 2022
Valentin Hofmann
Goran Glavaš
Nikola Ljubevsić
J. Pierrehumbert
Hinrich Schütze
VLM
391
21
0
16 Mar 2022
Modular and Parameter-Efficient Multimodal Fusion with Prompting
Findings (Findings), 2022
Sheng Liang
Mengjie Zhao
Hinrich Schütze
166
51
0
15 Mar 2022
Revitalize Region Feature for Democratizing Video-Language Pre-training of Retrieval
Guanyu Cai
Yixiao Ge
Binjie Zhang
Alex Jinpeng Wang
Rui Yan
...
Ying Shan
Lianghua He
Xiaohu Qie
Jianping Wu
Mike Zheng Shou
VLM
194
6
0
15 Mar 2022
All in One: Exploring Unified Video-Language Pre-training
Computer Vision and Pattern Recognition (CVPR), 2022
Alex Jinpeng Wang
Yixiao Ge
Rui Yan
Yuying Ge
Xudong Lin
Guanyu Cai
Jianping Wu
Ying Shan
Xiaohu Qie
Mike Zheng Shou
316
237
0
14 Mar 2022
Towards Visual-Prompt Temporal Answering Grounding in Medical Instructional Video
IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2022
Bin Li
Yixuan Weng
Bin Sun
Shutao Li
717
67
0
13 Mar 2022
Cross-modal Map Learning for Vision and Language Navigation
Computer Vision and Pattern Recognition (CVPR), 2022
G. Georgakis
Karl Schmeckpeper
Karan Wanchoo
Soham Dan
E. Miltsakaki
Dan Roth
Kostas Daniilidis
390
99
0
10 Mar 2022
CaSS: A Channel-aware Self-supervised Representation Learning Framework for Multivariate Time Series Classification
International Conference on Database Systems for Advanced Applications (DASFAA), 2022
Yijiang Chen
Xiangdong Zhou
Zhen Xing
Zhidan Liu
Minyang Xu
AI4TS
SSL
166
6
0
08 Mar 2022
Vision-Language Intelligence: Tasks, Representation Learning, and Large Models
Feng Li
Hao Zhang
Yi-Fan Zhang
Shixuan Liu
Jian Guo
L. Ni
Pengchuan Zhang
Lei Zhang
AI4TS
VLM
212
41
0
03 Mar 2022
High-Modality Multimodal Transformer: Quantifying Modality & Interaction Heterogeneity for High-Modality Representation Learning
Paul Pu Liang
Yiwei Lyu
Xiang Fan
Jeffrey Tsaw
Yudong Liu
Shentong Mo
Dani Yogatama
Louis-Philippe Morency
Ruslan Salakhutdinov
230
43
0
02 Mar 2022
SGL: Symbolic Goal Learning in a Hybrid, Modular Framework for Human Instruction Following
IEEE Robotics and Automation Letters (RA-L), 2022
Ruinian Xu
Hongyi Chen
Yunzhi Lin
Patricio A. Vela
171
7
0
25 Feb 2022
ISDA: Position-Aware Instance Segmentation with Deformable Attention
IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2022
Kaining Ying
Zhenhua Wang
Cong Bai
Pengfei Zhou
ISeg
228
7
0
23 Feb 2022
Movies2Scenes: Using Movie Metadata to Learn Scene Representation
Computer Vision and Pattern Recognition (CVPR), 2022
Shixing Chen
Chundi Liu
Xiang Hao
Xiaohan Nie
Maxim Arap
Raffay Hamid
228
18
0
22 Feb 2022
Multi-view and Multi-modal Event Detection Utilizing Transformer-based Multi-sensor fusion
IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2022
Masahiro Yasuda
Yasunori Ohishi
Shoichiro Saito
Noboru Harada
155
21
0
18 Feb 2022
AMS_ADRN at SemEval-2022 Task 5: A Suitable Image-text Multimodal Joint Modeling Method for Multi-task Misogyny Identification
International Workshop on Semantic Evaluation (SemEval), 2022
Da Li
Ming Yi
Yukai He
144
2
0
18 Feb 2022
VLP: A Survey on Vision-Language Pre-training
Machine Intelligence Research (MIR), 2022
Feilong Chen
Duzhen Zhang
Minglun Han
Xiuyi Chen
Jing Shi
Shuang Xu
Bo Xu
VLM
396
289
0
18 Feb 2022
When Did It Happen? Duration-informed Temporal Localization of Narrated Actions in Vlogs
Oana Ignat
Santiago Castro
Yuhang Zhou
Jiajun Bao
Dandan Shan
Amélie Reymond
217
3
0
16 Feb 2022
Not All Patches are What You Need: Expediting Vision Transformers via Token Reorganizations
Youwei Liang
Chongjian Ge
Zhan Tong
Yibing Song
Jue Wang
P. Xie
ViT
369
347
0
16 Feb 2022
CommerceMM: Large-Scale Commerce MultiModal Representation Learning with Omni Retrieval
Knowledge Discovery and Data Mining (KDD), 2022
Licheng Yu
Jun Chen
Animesh Sinha
Mengjiao MJ Wang
Hugo Chen
Tamara L. Berg
Ning Zhang
VLM
263
44
0
15 Feb 2022
UserBERT: Modeling Long- and Short-Term User Preferences via Self-Supervision
Tianyu Li
Ali Cevahir
Derek Cho
Hao Gong
Duy Nguyen
B. Stenger
SSL
86
1
0
14 Feb 2022
Learning To Recognize Procedural Activities with Distant Supervision
Computer Vision and Pattern Recognition (CVPR), 2022
Xudong Lin
Fabio Petroni
Gedas Bertasius
Marcus Rohrbach
Shih-Fu Chang
Lorenzo Torresani
260
96
0
26 Jan 2022
MGA-VQA: Multi-Granularity Alignment for Visual Question Answering
Peixi Xiong
Yilin Shen
Hongxia Jin
108
8
0
25 Jan 2022
Text and Code Embeddings by Contrastive Pre-Training
Arvind Neelakantan
Tao Xu
Raul Puri
Alec Radford
Jesse Michael Han
...
Tabarak Khan
Toki Sherbakov
Joanne Jang
Peter Welinder
Lilian Weng
SSL
AI4TS
610
538
0
24 Jan 2022
End-to-end Generative Pretraining for Multimodal Video Captioning
Computer Vision and Pattern Recognition (CVPR), 2022
Paul Hongsuck Seo
Arsha Nagrani
Anurag Arnab
Cordelia Schmid
300
185
0
20 Jan 2022
Video Transformers: A Survey
IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2022
Javier Selva
A. S. Johansen
Sergio Escalera
Kamal Nasrollahi
T. Moeslund
Albert Clapés
ViT
458
139
0
16 Jan 2022
Boundary-aware Self-supervised Learning for Video Scene Segmentation
Asian Conference on Computer Vision (ACCV), 2022
Jonghwan Mun
Minchul Shin
Gunsoo Han
Sangho Lee
S. Ha
Joonseok Lee
Eun-Sol Kim
SSL
161
25
0
14 Jan 2022
Pretrained Language Models for Text Generation: A Survey
ACM Computing Surveys (ACM CSUR), 2022
Junyi Li
Tianyi Tang
Wayne Xin Zhao
J. Nie
Ji-Rong Wen
AI4CE
525
268
0
14 Jan 2022
Bridging Video-text Retrieval with Multiple Choice Questions
Computer Vision and Pattern Recognition (CVPR), 2022
Yuying Ge
Yixiao Ge
Xihui Liu
Dian Li
Ying Shan
Xiaohu Qie
Ping Luo
BDL
296
121
0
13 Jan 2022
Uni-EDEN: Universal Encoder-Decoder Network by Multi-Granular Vision-Language Pre-training
Yehao Li
Jiahao Fan
Yingwei Pan
Ting Yao
Weiyao Lin
Tao Mei
MLLM
ObjD
222
24
0
11 Jan 2022
On the Efficacy of Co-Attention Transformer Layers in Visual Question Answering
Ankur Sikarwar
Gabriel Kreiman
ViT
109
2
0
11 Jan 2022
Multi-Query Video Retrieval
European Conference on Computer Vision (ECCV), 2022
Zeyu Wang
Yu Wu
Karthik Narasimhan
Olga Russakovsky
291
23
0
10 Jan 2022
MERLOT Reserve: Neural Script Knowledge through Vision and Language and Sound
Computer Vision and Pattern Recognition (CVPR), 2022
Rowan Zellers
Jiasen Lu
Ximing Lu
Youngjae Yu
Yanpeng Zhao
Mohammadreza Salehi
Aditya Kusupati
Jack Hessel
Ali Farhadi
Yejin Choi
514
239
0
07 Jan 2022
Progressive Video Summarization via Multimodal Self-supervised Learning
IEEE Workshop/Winter Conference on Applications of Computer Vision (WACV), 2022
Haopeng Li
Qiuhong Ke
Mingming Gong
Tom Drummond
AI4TS
336
34
0
07 Jan 2022
Discrete and continuous representations and processing in deep learning: Looking forward
AI Open (AO), 2022
Ruben Cartuyvels
Graham Spinks
Marie-Francine Moens
OCL
301
30
0
04 Jan 2022
InverseMV: Composing Piano Scores with a Convolutional Video-Music Transformer
Chin-Tung Lin
Mu Yang
ViT
174
3
0
31 Dec 2021
Synchronized Audio-Visual Frames with Fractional Positional Encoding for Transformers in Video-to-Text Translation
International Conference on Information Photonics (ICIP), 2021
Philipp Harzig
Moritz Einfalt
Rainer Lienhart
ViT
160
3
0
28 Dec 2021
A Survey of Natural Language Generation
ACM Computing Surveys (CSUR), 2021
Chenhe Dong
Hai-Tao Zheng
Haifan Gong
Mengzhao Chen
Junxin Li
Ying Shen
Min Yang
3DV
336
64
0
22 Dec 2021
Exploiting Long-Term Dependencies for Generating Dynamic Scene Graphs
IEEE Workshop/Winter Conference on Applications of Computer Vision (WACV), 2021
Shengyu Feng
Subarna Tripathi
Hesham Mostafa
Marcel Nassar
Somdeb Majumdar
278
34
0
18 Dec 2021
Align and Prompt: Video-and-Language Pre-training with Entity Prompts
Computer Vision and Pattern Recognition (CVPR), 2021
Dongxu Li
Junnan Li
Hongdong Li
Juan Carlos Niebles
Guosheng Lin
362
214
0
17 Dec 2021
Contrastive Vision-Language Pre-training with Limited Resources
Quan Cui
Boyan Zhou
Yu Guo
Weidong Yin
Hao Wu
Osamu Yoshie
Yubo Chen
VLM
CLIP
158
41
0
17 Dec 2021
CoCo-BERT: Improving Video-Language Pre-training with Contrastive Cross-modal Matching and Denoising
Jianjie Luo
Yehao Li
Yingwei Pan
Ting Yao
Hongyang Chao
Tao Mei
VLM
161
45
0
14 Dec 2021
Short and Long Range Relation Based Spatio-Temporal Transformer for Micro-Expression Recognition
Liangfei Zhang
Xiaopeng Hong
Ognjen Arandjelovic
Guoying Zhao
ViT
311
92
0
10 Dec 2021
Exploring Temporal Granularity in Self-Supervised Video Representation Learning
Rui Qian
Yeqing Li
Liangzhe Yuan
Boqing Gong
Ting Liu
Matthew A. Brown
Serge Belongie
Ming-Hsuan Yang
Hartwig Adam
Huayu Chen
AI4TS
200
7
0
08 Dec 2021
Suppressing Static Visual Cues via Normalizing Flows for Self-Supervised Video Representation Learning
Manlin Zhang
Jinpeng Wang
A. J. Ma
173
9
0
07 Dec 2021
Joint Learning of Localized Representations from Medical Images and Reports
European Conference on Computer Vision (ECCV), 2021
Philipp Muller
Georgios Kaissis
Cong Zou
Daniel Munich
440
113
0
06 Dec 2021
Uni-Perceiver: Pre-training Unified Architecture for Generic Perception for Zero-shot and Few-shot Tasks
Xizhou Zhu
Jinguo Zhu
Hao Li
Xiaoshi Wu
Xiaogang Wang
Jiaming Song
Xiaohua Wang
Jifeng Dai
251
152
0
02 Dec 2021
Video-Text Pre-training with Learned Regions
Rui Yan
Mike Zheng Shou
Yixiao Ge
Alex Jinpeng Wang
Xudong Lin
Guanyu Cai
Jinhui Tang
261
27
0
02 Dec 2021
Routing with Self-Attention for Multimodal Capsule Networks
Kevin Duarte
Brian Chen
Nina Shvetsova
Andrew Rouditchenko
Samuel Thomas
Alexander H. Liu
David Harwath
James R. Glass
Hilde Kuehne
M. Shah
SSL
138
5
0
01 Dec 2021
Object-aware Video-language Pre-training for Retrieval
Alex Jinpeng Wang
Yixiao Ge
Guanyu Cai
Rui Yan
Xudong Lin
Ying Shan
Xiaohu Qie
Mike Zheng Shou
ViT
VLM
286
86
0
01 Dec 2021
Previous
1
2
3
...
9
10
11
...
15
16
17
Next
Page 10 of 17
Page
of 17
Go