Communities
Connect sessions
AI calendar
Organizations
Join Slack
Contact Sales
Search
Open menu
Home
Papers
2103.15691
Cited By
v1
v2 (latest)
ViViT: A Video Vision Transformer
IEEE International Conference on Computer Vision (ICCV), 2021
29 March 2021
Anurag Arnab
Mostafa Dehghani
G. Heigold
Chen Sun
Mario Lucic
Cordelia Schmid
ViT
Re-assign community
ArXiv (abs)
PDF
HTML
HuggingFace (3 upvotes)
Github (3544★)
Papers citing
"ViViT: A Video Vision Transformer"
50 / 1,311 papers shown
Masked Feature Prediction for Self-Supervised Visual Pre-Training
Chen Wei
Haoqi Fan
Saining Xie
Chaoxia Wu
Alan Yuille
Christoph Feichtenhofer
ViT
522
792
0
16 Dec 2021
SeqFormer: Sequential Transformer for Video Instance Segmentation
Junfeng Wu
Yi Jiang
S. Bai
Wenqing Zhang
Xiang Bai
ViT
223
133
0
15 Dec 2021
Vision Transformer Based Video Hashing Retrieval for Tracing the Source of Fake Videos
Pengfei Pei
Xianfeng Zhao
Yun Cao
Jinchuan Li
Xiaowei Yi
ViT
287
9
0
15 Dec 2021
Co-training Transformer with Videos and Images Improves Action Recognition
Bowen Zhang
Jiahui Yu
Christopher Fifty
Wei Han
Andrew M. Dai
Ruoming Pang
Fei Sha
ViT
168
63
0
14 Dec 2021
Translating Human Mobility Forecasting through Natural Language Generation
Hao Xue
Flora D. Salim
Yongli Ren
C. Clarke
AI4TS
127
25
0
13 Dec 2021
Learning Generalizable Vision-Tactile Robotic Grasping Strategy for Deformable Objects via Transformer
Yunhai Han
Kelin Yu
Rahul Batra
Nathan Boyd
Chaitanya Mehta
T. Zhao
Y. She
S. Hutchinson
Ye Zhao
ViT
514
71
0
13 Dec 2021
DualFormer: Local-Global Stratified Transformer for Efficient Video Recognition
Yuxuan Liang
Pan Zhou
Roger Zimmermann
Shuicheng Yan
ViT
187
25
0
09 Dec 2021
MASTAF: A Model-Agnostic Spatio-Temporal Attention Fusion Network for Few-shot Video Classification
Rex Liu
Huan Zhang
Hamed Pirsiavash
Xin Liu
ViT
298
17
0
08 Dec 2021
FLAVA: A Foundational Language And Vision Alignment Model
Amanpreet Singh
Ronghang Hu
Vedanuj Goswami
Guillaume Couairon
Wojciech Galuba
Marcus Rohrbach
Douwe Kiela
CLIP
VLM
369
863
0
08 Dec 2021
Prompting Visual-Language Models for Efficient Video Understanding
Chen Ju
Tengda Han
Kunhao Zheng
Ya Zhang
Weidi Xie
VPVLM
VLM
374
460
0
08 Dec 2021
MS-TCT: Multi-Scale Temporal ConvTransformer for Action Detection
Rui Dai
Srijan Das
Kumara Kahatapitiya
Michael S. Ryoo
Francois Bremond
ViT
299
96
0
07 Dec 2021
LAVT: Language-Aware Vision Transformer for Referring Image Segmentation
Computer Vision and Pattern Recognition (CVPR), 2021
Zhao Yang
Yuan Liu
Yansong Tang
Kai-xiang Chen
Hengshuang Zhao
Juil Sock
883
424
0
04 Dec 2021
BEVT: BERT Pretraining of Video Transformers
Rui Wang
Dongdong Chen
Zuxuan Wu
Yinpeng Chen
Xiyang Dai
Xiyang Dai
Yu-Gang Jiang
Luowei Zhou
Lu Yuan
ViT
285
248
0
02 Dec 2021
MViTv2: Improved Multiscale Vision Transformers for Classification and Detection
Yanghao Li
Chaoxia Wu
Haoqi Fan
K. Mangalam
Bo Xiong
Jitendra Malik
Christoph Feichtenhofer
ViT
492
842
0
02 Dec 2021
Uni-Perceiver: Pre-training Unified Architecture for Generic Perception for Zero-shot and Few-shot Tasks
Xizhou Zhu
Jinguo Zhu
Hao Li
Xiaoshi Wu
Xiaogang Wang
Jiaming Song
Xiaohua Wang
Jifeng Dai
251
152
0
02 Dec 2021
Self-supervised Video Transformer
Kanchana Ranasinghe
Muzammal Naseer
Salman Khan
Fahad Shahbaz Khan
Michael S. Ryoo
ViT
332
109
0
02 Dec 2021
Video-Text Pre-training with Learned Regions
Rui Yan
Mike Zheng Shou
Yixiao Ge
Alex Jinpeng Wang
Xudong Lin
Guanyu Cai
Jinhui Tang
250
27
0
02 Dec 2021
Multi-domain Integrative Swin Transformer network for Sparse-View Tomographic Reconstruction
Jiayi Pan
Heye Zhang
Weifei Wu
Zijian Gao
Weiwen Wu
364
73
0
28 Nov 2021
SWAT: Spatial Structure Within and Among Tokens
International Joint Conference on Artificial Intelligence (IJCAI), 2021
Kumara Kahatapitiya
Michael S. Ryoo
264
7
0
26 Nov 2021
Semi-Supervised Music Tagging Transformer
International Society for Music Information Retrieval Conference (ISMIR), 2021
Minz Won
Keunwoo Choi
Xavier Serra
ViT
MedIm
465
51
0
26 Nov 2021
A Robust Volumetric Transformer for Accurate 3D Tumor Segmentation
International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI), 2021
Himashi Peiris
Munawar Hayat
Zhaolin Chen
Gary Egan
Mehrtash Harandi
ViT
MedIm
207
182
0
26 Nov 2021
SwinBERT: End-to-End Transformers with Sparse Attention for Video Captioning
Computer Vision and Pattern Recognition (CVPR), 2021
Kevin Qinghong Lin
Linjie Li
Chung-Ching Lin
Faisal Ahmed
Zhe Gan
Zicheng Liu
Yumao Lu
Lijuan Wang
ViT
337
302
0
25 Nov 2021
PolyViT: Co-training Vision Transformers on Images, Videos and Audio
Valerii Likhosherstov
Anurag Arnab
K. Choromanski
Mario Lucic
Yi Tay
Adrian Weller
Mostafa Dehghani
ViT
192
83
0
25 Nov 2021
MorphMLP: An Efficient MLP-Like Backbone for Spatial-Temporal Representation Learning
David Junhao Zhang
Kunchang Li
Yali Wang
Yuxiang Chen
Shashwat Chandra
Yu Qiao
Luoqi Liu
Mike Zheng Shou
AI4TS
208
35
0
24 Nov 2021
PhysFormer: Facial Video-based Physiological Measurement with Temporal Difference Transformer
Zitong Yu
Yuming Shen
Jingang Shi
Hengshuang Zhao
Juil Sock
Guoying Zhao
ViT
MedIm
344
241
0
23 Nov 2021
Efficient Video Transformers with Spatial-Temporal Token Selection
Junke Wang
Xitong Yang
Hengduo Li
Li Liu
Zuxuan Wu
Yu-Gang Jiang
ViT
199
82
0
23 Nov 2021
Ice hockey player identification via transformers and weakly supervised learning
Kanav Vats
William J. McNally
Pascale Walters
David A Clausi
John S. Zelek
ViT
156
27
0
22 Nov 2021
Florence: A New Foundation Model for Computer Vision
Lu Yuan
Dongdong Chen
Yi-Ling Chen
Noel Codella
Xiyang Dai
...
Zhen Xiao
Jianwei Yang
Michael Zeng
Luowei Zhou
Pengchuan Zhang
VLM
391
1,049
0
22 Nov 2021
Exploring Segment-level Semantics for Online Phase Recognition from Surgical Videos
IEEE Transactions on Medical Imaging (IEEE TMI), 2021
Xinpeng Ding
Xiaomeng Li
345
46
0
22 Nov 2021
Swin Transformer V2: Scaling Up Capacity and Resolution
Ze Liu
Han Hu
Yutong Lin
Zhuliang Yao
Zhenda Xie
...
Yue Cao
Zheng Zhang
Li Dong
Furu Wei
B. Guo
ViT
553
2,413
0
18 Nov 2021
Evaluating Transformers for Lightweight Action Recognition
Raivo Koot
Markus Hennerbichler
Haiping Lu
ViT
226
8
0
18 Nov 2021
Benchmarking and scaling of deep learning models for land cover image classification
Ioannis Papoutsis
Nikolaos Ioannis Bountos
Angelos Zavras
Dimitrios Michail
Christos Tryfonopoulos
454
72
0
18 Nov 2021
Mask-guided Spectral-wise Transformer for Efficient Hyperspectral Image Reconstruction
Computer Vision and Pattern Recognition (CVPR), 2021
Yuanhao Cai
Jing Lin
Xiaowan Hu
Haoqian Wang
X. Yuan
Yulun Zhang
Radu Timofte
Luc Van Gool
167
344
0
15 Nov 2021
Relational Self-Attention: What's Missing in Attention for Video Understanding
Neural Information Processing Systems (NeurIPS), 2021
Manjin Kim
Heeseung Kwon
Chunyu Wang
Suha Kwak
Minsu Cho
ViT
173
36
0
02 Nov 2021
With a Little Help from my Temporal Context: Multimodal Egocentric Action Recognition
British Machine Vision Conference (BMVC), 2021
Evangelos Kazakos
Jaesung Huh
Arsha Nagrani
Andrew Zisserman
Dima Damen
EgoV
297
54
0
01 Nov 2021
Blending Anti-Aliasing into Vision Transformer
Neural Information Processing Systems (NeurIPS), 2021
Shengju Qian
Hao Shao
Yi Zhu
Mu Li
Jiaya Jia
213
23
0
28 Oct 2021
History Aware Multimodal Transformer for Vision-and-Language Navigation
Shizhe Chen
Pierre-Louis Guhur
Cordelia Schmid
Ivan Laptev
LM&Ro
299
309
0
25 Oct 2021
The Efficiency Misnomer
International Conference on Learning Representations (ICLR), 2021
Daoyuan Chen
Liuyi Yao
Dawei Gao
Ashish Vaswani
Yaliang Li
278
112
0
25 Oct 2021
SCENIC: A JAX Library for Computer Vision Research and Beyond
Mostafa Dehghani
A. Gritsenko
Anurag Arnab
Matthias Minderer
Yi Tay
206
75
0
18 Oct 2021
Object-Region Video Transformers
Roei Herzig
Elad Ben-Avraham
K. Mangalam
Amir Bar
Gal Chechik
Anna Rohrbach
Trevor Darrell
Amir Globerson
ViT
382
98
0
13 Oct 2021
StARformer: Transformer with State-Action-Reward Representations for Visual Reinforcement Learning
European Conference on Computer Vision (ECCV), 2021
Jinghuan Shang
Kumara Kahatapitiya
Xiang Li
Michael S. Ryoo
OffRL
406
41
0
12 Oct 2021
TAda! Temporally-Adaptive Convolutions for Video Understanding
International Conference on Learning Representations (ICLR), 2021
Ziyuan Huang
Shiwei Zhang
Liang Pan
Zhiwu Qing
Mingqian Tang
Ziwei Liu
M. Ang
415
68
0
12 Oct 2021
Multi-Modal Pre-Training for Automated Speech Recognition
IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2021
David M. Chan
Shalini Ghosh
D. Chakrabarty
Björn Hoffmeister
SSL
220
16
0
12 Oct 2021
Video Is Graph: Structured Graph Module for Video Action Recognition
Rongjie Li
Xiaojun Wu
Tianyang Xu
368
15
0
12 Oct 2021
EfficientPhys: Enabling Simple, Fast and Accurate Camera-Based Vitals Measurement
IEEE Workshop/Winter Conference on Applications of Computer Vision (WACV), 2021
Xin Liu
B. Hill
Ziheng Jiang
Shwetak N. Patel
Daniel J. McDuff
3DH
MedIm
291
135
0
09 Oct 2021
Exploring the Limits of Large Scale Pre-training
Samira Abnar
Mostafa Dehghani
Behnam Neyshabur
Hanie Sedghi
AI4CE
211
133
0
05 Oct 2021
PETA: Photo Albums Event Recognition using Transformers Attention
International Conference on Pattern Recognition (ICPR), 2021
Tamar Glaser
Emanuel Ben-Baruch
Gilad Sharir
Nadav Zamir
Asaf Noy
Lihi Zelnik-Manor
ViT
132
2
0
26 Sep 2021
Long-Range Transformers for Dynamic Spatiotemporal Forecasting
J. E. Grigsby
Zhe Wang
Nam Nguyen
Yanjun Qi
AI4TS
309
117
0
24 Sep 2021
Scale Efficiently: Insights from Pre-training and Fine-tuning Transformers
Yi Tay
Mostafa Dehghani
J. Rao
W. Fedus
Samira Abnar
Hyung Won Chung
Sharan Narang
Dani Yogatama
Ashish Vaswani
Donald Metzler
994
137
0
22 Sep 2021
Audio-Visual Speech Recognition is Worth 32
×
\times
×
32
×
\times
×
8 Voxels
Dmitriy Serdyuk
Otavio Braga
Olivier Siohan
ViT
183
7
0
20 Sep 2021
Previous
1
2
3
...
24
25
26
27
Next