Communities
Connect sessions
AI calendar
Organizations
Join Slack
Contact Sales
Search
Open menu
Home
Papers
1904.01766
Cited By
v1
v2 (latest)
VideoBERT: A Joint Model for Video and Language Representation Learning
3 April 2019
Chen Sun
Austin Myers
Carl Vondrick
Kevin Patrick Murphy
Cordelia Schmid
VLM
SSL
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"VideoBERT: A Joint Model for Video and Language Representation Learning"
50 / 803 papers shown
Learning grounded word meaning representations on similarity graphs
Conference on Empirical Methods in Natural Language Processing (EMNLP), 2021
Mariella Dimiccoli
H. Wendt
Pau Batlle
155
1
0
07 Sep 2021
Sensor-Augmented Egocentric-Video Captioning with Dynamic Modal Attention
ACM Multimedia (ACM MM), 2021
Katsuyuki Nakamura
Hiroki Ohashi
Mitsuhiro Okada
EgoV
212
14
0
07 Sep 2021
Vision Guided Generative Pre-trained Language Models for Multimodal Abstractive Summarization
Conference on Empirical Methods in Natural Language Processing (EMNLP), 2021
Tiezheng Yu
Wenliang Dai
Zihan Liu
Pascale Fung
297
79
0
06 Sep 2021
Audio-Visual Transformer Based Crowd Counting
Usman Sajid
Xiangyu Chen
Hasan Sajid
Taejoon Kim
Guanghui Wang
ViT
237
24
0
04 Sep 2021
Zero-shot Natural Language Video Localization
IEEE International Conference on Computer Vision (ICCV), 2021
Jinwoo Nam
Daechul Ahn
Luan Tuyen Chau
S. Ha
Jonghyun Choi
348
55
0
29 Aug 2021
Drop-DTW: Aligning Common Signal Between Sequences While Dropping Outliers
Neural Information Processing Systems (NeurIPS), 2021
Nikita Dvornik
Isma Hadji
Konstantinos G. Derpanis
Animesh Garg
Allan D. Jepson
162
62
0
26 Aug 2021
TACo: Token-aware Cascade Contrastive Learning for Video-Text Alignment
IEEE International Conference on Computer Vision (ICCV), 2021
Jianwei Yang
Yonatan Bisk
Jianfeng Gao
226
154
0
23 Aug 2021
Guiding Query Position and Performing Similar Attention for Transformer-Based Detection Heads
Xiaohu Jiang
Ze Chen
Zhicheng Wang
Erjin Zhou
Chun Yuan
121
2
0
22 Aug 2021
Knowledge Perceived Multi-modal Pretraining in E-commerce
Yushan Zhu
Huaixiao Tou
Wen Zhang
Ganqiang Ye
Hui Chen
Ningyu Zhang
Huajun Chen
232
37
0
20 Aug 2021
Investigating transformers in the decomposition of polygonal shapes as point collections
A. Alfieri
Yancong Lin
Jan van Gemert
ViT
3DPC
183
2
0
17 Aug 2021
Who's Waldo? Linking People Across Text and Images
Claire Yuqing Cui
Apoorv Khandelwal
Yoav Artzi
Noah Snavely
Hadar Averbuch-Elor
205
21
0
16 Aug 2021
Multiview Detection with Shadow Transformer (and View-Coherent Data Augmentation)
ACM Multimedia (ACM MM), 2021
Yunzhong Hou
Liang Zheng
ViT
176
65
0
12 Aug 2021
Video Transformer for Deepfake Detection with Incremental Learning
ACM Multimedia (ACM MM), 2021
Sohail Ahmed Khan
Hang Dai
ViT
209
78
0
11 Aug 2021
Vision Transformer with Progressive Sampling
Xiaoyu Yue
Shuyang Sun
Zhanghui Kuang
Meng Wei
Juil Sock
Wayne Zhang
Dahua Lin
ViT
202
99
0
03 Aug 2021
Word2Pix: Word to Pixel Cross Attention Transformer in Visual Grounding
IEEE Transactions on Neural Networks and Learning Systems (TNNLS), 2021
Heng Zhao
Qiufeng Wang
Yew-Soon Ong
ObjD
194
33
0
31 Jul 2021
Multimodal Co-learning: Challenges, Applications with Datasets, Recent Advances and Future Directions
Information Fusion (Inf. Fusion), 2021
Anil Rahate
Rahee Walambe
S. Ramanna
K. Kotecha
389
175
0
29 Jul 2021
Pre-train, Prompt, and Predict: A Systematic Survey of Prompting Methods in Natural Language Processing
ACM Computing Surveys (CSUR), 2021
Pengfei Liu
Weizhe Yuan
Jinlan Fu
Zhengbao Jiang
Hiroaki Hayashi
Graham Neubig
VLM
SyDa
775
4,857
0
28 Jul 2021
Predicting the Future from First Person (Egocentric) Vision: A Survey
Computer Vision and Image Understanding (CVIU), 2021
Ivan Rodin
Antonino Furnari
Dimitrios Mavroeidis
G. Farinella
EgoV
203
52
0
28 Jul 2021
Exceeding the Limits of Visual-Linguistic Multi-Task Learning
Cameron R. Wolfe
Keld T. Lundgaard
VLM
144
3
0
27 Jul 2021
LAORAM: A Look Ahead ORAM Architecture for Training Large Embedding Tables
International Symposium on Computer Architecture (ISCA), 2021
Rachit Rajat
Yongqin Wang
M. Annavaram
167
12
0
16 Jul 2021
BERT-like Pre-training for Symbolic Piano Music Classification Tasks
Yi-Hui Chou
I-Chun Chen
Chin-Jui Chang
Joann Ching
Yi-Hsuan Yang
272
28
0
12 Jul 2021
Local-to-Global Self-Attention in Vision Transformers
Jinpeng Li
Manwen Liao
Tianran Ouyang
Xiaokang Yang
Ling Shao
ViT
121
35
0
10 Jul 2021
Learning Vision-Guided Quadrupedal Locomotion End-to-End with Cross-Modal Transformers
International Conference on Learning Representations (ICLR), 2021
Ruihan Yang
Minghao Zhang
Nicklas Hansen
Huazhe Xu
Xiaolong Wang
OffRL
306
132
0
08 Jul 2021
VidLanKD: Improving Language Understanding via Video-Distilled Knowledge Transfer
Zineng Tang
Jaemin Cho
Hao Tan
Joey Tianyi Zhou
VLM
194
34
0
06 Jul 2021
Test-Time Personalization with a Transformer for Human Pose Estimation
Yizhuo Li
Miao Hao
Zonglin Di
N. B. Gundavarapu
Xiaolong Wang
ViT
302
55
0
05 Jul 2021
Cross-Modal Transformer-Based Neural Correction Models for Automatic Speech Recognition
Tomohiro Tanaka
Ryo Masumura
Mana Ihori
Akihiko Takashima
Takafumi Moriya
Takanori Ashihara
Shota Orihashi
Naoki Makishima
114
8
0
04 Jul 2021
Attention Bottlenecks for Multimodal Fusion
Neural Information Processing Systems (NeurIPS), 2021
Arsha Nagrani
Shan Yang
Anurag Arnab
A. Jansen
Cordelia Schmid
Chen Sun
577
698
0
30 Jun 2021
A Generative Model for Raw Audio Using Transformer Architectures
International Conference on Digital Audio Effects (DAFx), 2021
Prateek Verma
C. Chafe
243
36
0
30 Jun 2021
iReason: Multimodal Commonsense Reasoning using Videos and Natural Language with Interpretability
Andrew Wang
Vasu Sharma
CML
137
5
0
25 Jun 2021
Towards Long-Form Video Understanding
Computer Vision and Pattern Recognition (CVPR), 2021
Chaoxia Wu
Philipp Krahenbuhl
VLM
ViT
323
194
0
21 Jun 2021
End-to-end Temporal Action Detection with Transformer
IEEE Transactions on Image Processing (TIP), 2021
Xiaolong Liu
Qimeng Wang
Yao Hu
Xu Tang
Shiwei Zhang
S. Bai
X. Bai
ViT
306
292
0
18 Jun 2021
All You Can Embed: Natural Language based Vehicle Retrieval with Spatio-Temporal Transformers
Carmelo Scribano
D. Sapienza
Giorgia Franchini
M. Verucchi
Marko Bertogna
134
6
0
18 Jun 2021
GEM: A General Evaluation Benchmark for Multimodal Tasks
Findings (Findings), 2021
Lin Su
Nan Duan
Edward Cui
Lei Ji
Chenfei Wu
Huaishao Luo
Yongfei Liu
Ming Zhong
Taroon Bharti
Arun Sacheti
VLM
204
22
0
18 Jun 2021
Pre-Trained Models: Past, Present and Future
AI Open (AO), 2021
Xu Han
Zhengyan Zhang
Ning Ding
Yuxian Gu
Xiao Liu
...
Jie Tang
Ji-Rong Wen
Jinhui Yuan
Wayne Xin Zhao
Jun Zhu
AIFin
MQ
AI4MH
385
990
0
14 Jun 2021
Cross-Modal Attention Consistency for Video-Audio Unsupervised Learning
Shaobo Min
Jingdong Sun
Hongtao Xie
Chuang Gan
Yongdong Zhang
Jingdong Wang
SSL
150
6
0
13 Jun 2021
Transformed CNNs: recasting pre-trained convolutional layers with self-attention
Stéphane dÁscoli
Levent Sagun
Giulio Biroli
Ari S. Morcos
ViT
98
7
0
10 Jun 2021
Keeping Your Eye on the Ball: Trajectory Attention in Video Transformers
Neural Information Processing Systems (NeurIPS), 2021
Mandela Patrick
Dylan Campbell
Yuki M. Asano
Ishan Misra
Ishan Misra Florian Metze
Christoph Feichtenhofer
Andrea Vedaldi
João F. Henriques
283
340
0
09 Jun 2021
Semi-Supervised 3D Hand-Object Poses Estimation with Interactions in Time
Computer Vision and Pattern Recognition (CVPR), 2021
Shao-Wei Liu
Hanwen Jiang
Jiarui Xu
Sifei Liu
Xiaolong Wang
3DH
231
192
0
09 Jun 2021
VALUE: A Multi-Task Benchmark for Video-and-Language Understanding Evaluation
Linjie Li
Jie Lei
Zhe Gan
Licheng Yu
Yen-Chun Chen
...
Tamara L. Berg
Joey Tianyi Zhou
Jingjing Liu
Lijuan Wang
Zicheng Liu
VLM
265
117
0
08 Jun 2021
A Survey of Transformers
AI Open (AO), 2021
Tianyang Lin
Yuxin Wang
Xiangyang Liu
Xipeng Qiu
ViT
445
1,386
0
08 Jun 2021
Efficient Training of Visual Transformers with Small Datasets
Neural Information Processing Systems (NeurIPS), 2021
Yahui Liu
E. Sangineto
Wei Bi
Andrii Zadaianchuk
Bruno Lepri
Marco De Nadai
ViT
180
213
0
07 Jun 2021
BERTGEN: Multi-task Generation through BERT
Annual Meeting of the Association for Computational Linguistics (ACL), 2021
Faidon Mitzalis
Ozan Caglayan
Pranava Madhyastha
Lucia Specia
VLM
111
7
0
07 Jun 2021
Transformed ROIs for Capturing Visual Transformations in Videos
Computer Vision and Image Understanding (CVIU), 2021
Abhinav Rai
Fadime Sener
Angela Yao
ViT
230
4
0
06 Jun 2021
Transferring Knowledge from Text to Video: Zero-Shot Anticipation for Procedural Actions
IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2021
Fadime Sener
Rishabh Saraf
Angela Yao
LM&Ro
183
17
0
06 Jun 2021
MERLOT: Multimodal Neural Script Knowledge Models
Neural Information Processing Systems (NeurIPS), 2021
Rowan Zellers
Ximing Lu
Jack Hessel
Youngjae Yu
J. S. Park
Jize Cao
Ali Farhadi
Yejin Choi
VLM
LRM
348
428
0
04 Jun 2021
Anticipative Video Transformer
IEEE International Conference on Computer Vision (ICCV), 2021
Rohit Girdhar
Kristen Grauman
ViT
335
251
0
03 Jun 2021
TVDIM: Enhancing Image Self-Supervised Pretraining via Noisy Text Data
Pengda Qin
Yuhong Li
Kefeng Deng
Qiang Wu
120
1
0
03 Jun 2021
Attention mechanisms and deep learning for machine vision: A survey of the state of the art
A. M. Hafiz
S. A. Parah
R. A. Bhat
227
56
0
03 Jun 2021
Connecting Language and Vision for Natural Language-Based Vehicle Retrieval
Shuai Bai
Zhedong Zheng
Xiaohan Wang
Junyang Lin
Zhu Zhang
Chang Zhou
Yi Yang
Hongxia Yang
227
30
0
31 May 2021
Rethinking the constraints of multimodal fusion: case study in Weakly-Supervised Audio-Visual Video Parsing
Jianning Wu
Zhuqing Jiang
S. Wen
Aidong Men
Haiying Wang
223
1
0
30 May 2021
Previous
1
2
3
...
11
12
13
...
15
16
17
Next