Communities
Connect sessions
AI calendar
Organizations
Join Slack
Contact Sales
Search
Open menu
Home
Papers
2103.15691
Cited By
v1
v2 (latest)
ViViT: A Video Vision Transformer
IEEE International Conference on Computer Vision (ICCV), 2021
29 March 2021
Anurag Arnab
Mostafa Dehghani
G. Heigold
Chen Sun
Mario Lucic
Cordelia Schmid
ViT
Re-assign community
ArXiv (abs)
PDF
HTML
HuggingFace (3 upvotes)
Github (3544★)
Papers citing
"ViViT: A Video Vision Transformer"
50 / 1,302 papers shown
Title
Efficient Video Instance Segmentation via Tracklet Query and Proposal
Computer Vision and Pattern Recognition (CVPR), 2022
Jialian Wu
Sudhir Yarram
Hui Liang
Tian Lan
Junsong Yuan
J. Eledath
Gérard Medioni
175
43
0
03 Mar 2022
Multi-Tailed Vision Transformer for Efficient Inference
Neural Networks (NN), 2022
Yunke Wang
Bo Du
Wenyuan Wang
Chang Xu
ViT
518
10
0
03 Mar 2022
ViTransPAD: Video Transformer using convolution and self-attention for Face Presentation Attack Detection
International Conference on Information Photonics (ICIP), 2022
Zuheng Ming
Zitong Yu
M. Al-Ghadi
M. Visani
M. Luqman
J. Burie
ViT
CVBM
129
24
0
03 Mar 2022
TransDARC: Transformer-based Driver Activity Recognition with Latent Space Feature Calibration
IEEE/RJS International Conference on Intelligent RObots and Systems (IROS), 2022
Kunyu Peng
Alina Roitberg
Kailun Yang
Kailai Li
Rainer Stiefelhagen
ViT
153
39
0
02 Mar 2022
FastFold: Reducing AlphaFold Training Time from 11 Days to 67 Hours
Shenggan Cheng
Xuanlei Zhao
Guangyang Lu
Bin-Rui Li
Zhongming Yu
Tian Zheng
R. Wu
Xiwen Zhang
Jian Peng
Yang You
AI4CE
189
35
0
02 Mar 2022
Temporal Perceiver: A General Architecture for Arbitrary Boundary Detection
IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2022
Jing Tan
Yuhong Wang
Gangshan Wu
Limin Wang
187
19
0
01 Mar 2022
On Modality Bias Recognition and Reduction
Yangyang Guo
Liqiang Nie
Harry Cheng
Zhiyong Cheng
Mohan S. Kankanhalli
Marco Bertini
230
48
0
25 Feb 2022
Motion-driven Visual Tempo Learning for Video-based Action Recognition
IEEE Transactions on Image Processing (IEEE TIP), 2022
Yuanzhong Liu
Junsong Yuan
Zhigang Tu
183
77
0
24 Feb 2022
Delving Deep into One-Shot Skeleton-based Action Recognition with Diverse Occlusions
IEEE transactions on multimedia (IEEE TMM), 2022
Kunyu Peng
Alina Roitberg
Kailun Yang
Kailai Li
Rainer Stiefelhagen
ViT
309
40
0
23 Feb 2022
GroupViT: Semantic Segmentation Emerges from Text Supervision
Computer Vision and Pattern Recognition (CVPR), 2022
Jiarui Xu
Shalini De Mello
Sifei Liu
Wonmin Byeon
Thomas Breuel
Jan Kautz
Xinyu Wang
ViT
VLM
635
622
0
22 Feb 2022
HiP: Hierarchical Perceiver
João Carreira
Skanda Koppula
Daniel Zoran
Adrià Recasens
Catalin Ionescu
...
M. Botvinick
Oriol Vinyals
Karen Simonyan
Andrew Zisserman
Andrew Jaegle
VLM
315
14
0
22 Feb 2022
Movies2Scenes: Using Movie Metadata to Learn Scene Representation
Computer Vision and Pattern Recognition (CVPR), 2022
Shixing Chen
Chundi Liu
Xiang Hao
Xiaohan Nie
Maxim Arap
Raffay Hamid
180
17
0
22 Feb 2022
ActionFormer: Localizing Moments of Actions with Transformers
European Conference on Computer Vision (ECCV), 2022
Chen-Da Liu-Zhang
Jianxin Wu
Yin Li
ViT
257
435
0
16 Feb 2022
Characterizing and overcoming the greedy nature of learning in multi-modal deep neural networks
International Conference on Machine Learning (ICML), 2022
Nan Wu
Stanislaw Jastrzebski
Dong Wang
Krzysztof J. Geras
125
108
0
10 Feb 2022
OWL (Observe, Watch, Listen): Audiovisual Temporal Context for Localizing Actions in Egocentric Videos
Merey Ramazanova
Victor Escorcia
Fabian Caba Heilbron
Chen Zhao
Guohao Li
179
4
0
10 Feb 2022
Video Violence Recognition and Localization Using a Semi-Supervised Hard Attention Model
Expert systems with applications (ESWA), 2022
Hamid Reza Mohammadi
Ehsan Nazerfard
390
33
0
04 Feb 2022
VRT: A Video Restoration Transformer
IEEE Transactions on Image Processing (IEEE TIP), 2022
Christos Sakaridis
Jingyun Liang
Yuchen Fan
Lucas Beerens
Rakesh Ranjan
Yawei Li
Radu Timofte
Luc Van Gool
ViT
318
325
0
28 Jan 2022
Learning To Recognize Procedural Activities with Distant Supervision
Computer Vision and Pattern Recognition (CVPR), 2022
Xudong Lin
Fabio Petroni
Gedas Bertasius
Marcus Rohrbach
Shih-Fu Chang
Lorenzo Torresani
222
94
0
26 Jan 2022
Predicting Knee Osteoarthritis Progression from Structural MRI using Deep Learning
IEEE International Symposium on Biomedical Imaging (ISBI), 2022
E. Panfilov
S. Saarakkala
M. Nieminen
A. Tiulpin
MedIm
254
16
0
26 Jan 2022
Transformer-Based Video Front-Ends for Audio-Visual Speech Recognition for Single and Multi-Person Video
Interspeech (Interspeech), 2022
Dmitriy Serdyuk
Otavio Braga
Olivier Siohan
ViT
270
45
0
25 Jan 2022
Transformers in Medical Imaging: A Survey
Fahad Shamshad
Salman Khan
Syed Waqas Zamir
Muhammad Haris Khan
Munawar Hayat
Fahad Shahbaz Khan
Huazhu Fu
ViT
LM&MA
MedIm
272
917
0
24 Jan 2022
UniFormer: Unifying Convolution and Self-attention for Visual Recognition
IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2022
Kunchang Li
Yali Wang
Junhao Zhang
Shiyang Feng
Guanglu Song
Yu Liu
Jiaming Song
Yu Qiao
ViT
423
509
0
24 Jan 2022
MeMViT: Memory-Augmented Multiscale Vision Transformer for Efficient Long-Term Video Recognition
Computer Vision and Pattern Recognition (CVPR), 2022
Chao-Yuan Wu
Yanghao Li
K. Mangalam
Haoqi Fan
Bo Xiong
Jitendra Malik
Christoph Feichtenhofer
ViT
349
242
0
20 Jan 2022
Omnivore: A Single Model for Many Visual Modalities
Computer Vision and Pattern Recognition (CVPR), 2022
Rohit Girdhar
Mannat Singh
Nikhil Ravi
Laurens van der Maaten
Armand Joulin
Ishan Misra
485
283
0
20 Jan 2022
End-to-end Generative Pretraining for Multimodal Video Captioning
Computer Vision and Pattern Recognition (CVPR), 2022
Paul Hongsuck Seo
Arsha Nagrani
Anurag Arnab
Cordelia Schmid
232
184
0
20 Jan 2022
Continual Transformers: Redundancy-Free Attention for Online Inference
International Conference on Learning Representations (ICLR), 2022
Lukas Hedegaard
Arian Bakhtiarnia
Alexandros Iosifidis
CLL
360
14
0
17 Jan 2022
Video Transformers: A Survey
IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2022
Javier Selva
A. S. Johansen
Sergio Escalera
Kamal Nasrollahi
T. Moeslund
Albert Clapés
ViT
374
132
0
16 Jan 2022
Transformers in Action: Weakly Supervised Action Segmentation
John Ridley
Huseyin Coskun
D. Tan
Nassir Navab
F. Tombari
ViT
133
5
0
14 Jan 2022
ViT2Hash: Unsupervised Information-Preserving Hashing
Qinkang Gong
Liangdao Wang
Hanjiang Lai
Yan Pan
Jian Yin
95
5
0
14 Jan 2022
UniFormer: Unified Transformer for Efficient Spatiotemporal Representation Learning
International Conference on Learning Representations (ICLR), 2022
Kunchang Li
Yali Wang
Shiyang Feng
Guanglu Song
Yu Liu
Jiaming Song
Yu Qiao
ViT
375
318
0
12 Jan 2022
Multiview Transformers for Video Recognition
Computer Vision and Pattern Recognition (CVPR), 2022
Shen Yan
Xuehan Xiong
Anurag Arnab
Zhichao Lu
Mi Zhang
Chen Sun
Cordelia Schmid
ViT
346
263
0
12 Jan 2022
MAXIM: Multi-Axis MLP for Image Processing
Computer Vision and Pattern Recognition (CVPR), 2022
Zhengzhong Tu
Hossein Talebi
Han Zhang
Feng Yang
P. Milanfar
A. Bovik
Yinxiao Li
232
620
0
09 Jan 2022
Spatio-Temporal Tuples Transformer for Skeleton-Based Action Recognition
Helei Qiu
B. Hou
Bo Ren
Xiaohua Zhang
ViT
182
61
0
08 Jan 2022
Flow-Guided Sparse Transformer for Video Deblurring
International Conference on Machine Learning (ICML), 2022
Jing Lin
Yuanhao Cai
Xiaowan Hu
Haoqian Wang
Youliang Yan
X. Zou
Henghui Ding
Yulun Zhang
Radu Timofte
Luc Van Gool
ViT
171
73
0
06 Jan 2022
Lawin Transformer: Improving Semantic Segmentation Transformer with Multi-Scale Representations via Large Window Attention
Haotian Yan
Chuang Zhang
Ming Wu
ViT
304
75
0
05 Jan 2022
RFormer: Transformer-based Generative Adversarial Network for Real Fundus Image Restoration on A New Clinical Benchmark
IEEE journal of biomedical and health informatics (IEEE JBHI), 2022
Zhuo Deng
Yuanhao Cai
Lu Chen
Zheng Gong
Qiqi Bao
Xue Yao
D. Fang
Shaochong Zhang
Lan Ma
ViT
MedIm
282
72
0
03 Jan 2022
AdaFocus V2: End-to-End Training of Spatial Dynamic Networks for Video Recognition
Computer Vision and Pattern Recognition (CVPR), 2021
Yulin Wang
Yang Yue
Yuanze Lin
Haojun Jiang
Zihang Lai
V. Kulikov
Nikita Orlov
Humphrey Shi
Gao Huang
181
62
0
28 Dec 2021
MPViT: Multi-Path Vision Transformer for Dense Prediction
Computer Vision and Pattern Recognition (CVPR), 2021
Youngwan Lee
Jonghee Kim
Jeffrey Willette
Sung Ju Hwang
ViT
270
315
0
21 Dec 2021
LocFormer: Enabling Transformers to Perform Temporal Moment Localization on Long Untrimmed Videos With a Feature Sampling Approach
Cristian Rodriguez-Opazo
Edison Marrese-Taylor
Basura Fernando
Hiroya Takamura
Qi Wu
ViT
169
3
0
19 Dec 2021
A Simple Single-Scale Vision Transformer for Object Localization and Instance Segmentation
European Conference on Computer Vision (ECCV), 2021
Wuyang Chen
Xianzhi Du
Fan Yang
Lucas Beyer
Xiaohua Zhai
...
Huizhong Chen
Jing Li
Xiaodan Song
Zinan Lin
Denny Zhou
ViT
197
29
0
17 Dec 2021
Distillation of Human-Object Interaction Contexts for Action Recognition
Muna Almushyti
Frederick W. Li
250
4
0
17 Dec 2021
Masked Feature Prediction for Self-Supervised Visual Pre-Training
Chen Wei
Haoqi Fan
Saining Xie
Chaoxia Wu
Alan Yuille
Christoph Feichtenhofer
ViT
433
779
0
16 Dec 2021
SeqFormer: Sequential Transformer for Video Instance Segmentation
Junfeng Wu
Yi Jiang
S. Bai
Wenqing Zhang
Xiang Bai
ViT
190
132
0
15 Dec 2021
Vision Transformer Based Video Hashing Retrieval for Tracing the Source of Fake Videos
Pengfei Pei
Xianfeng Zhao
Yun Cao
Jinchuan Li
Xiaowei Yi
ViT
220
9
0
15 Dec 2021
Co-training Transformer with Videos and Images Improves Action Recognition
Bowen Zhang
Jiahui Yu
Christopher Fifty
Wei Han
Andrew M. Dai
Ruoming Pang
Fei Sha
ViT
140
62
0
14 Dec 2021
Translating Human Mobility Forecasting through Natural Language Generation
Hao Xue
Flora D. Salim
Yongli Ren
C. Clarke
AI4TS
119
25
0
13 Dec 2021
Learning Generalizable Vision-Tactile Robotic Grasping Strategy for Deformable Objects via Transformer
Yunhai Han
Kelin Yu
Rahul Batra
Nathan Boyd
Chaitanya Mehta
T. Zhao
Y. She
S. Hutchinson
Ye Zhao
ViT
401
67
0
13 Dec 2021
DualFormer: Local-Global Stratified Transformer for Efficient Video Recognition
Yuxuan Liang
Pan Zhou
Roger Zimmermann
Shuicheng Yan
ViT
168
24
0
09 Dec 2021
MASTAF: A Model-Agnostic Spatio-Temporal Attention Fusion Network for Few-shot Video Classification
Rex Liu
Huan Zhang
Hamed Pirsiavash
Xin Liu
ViT
232
16
0
08 Dec 2021
FLAVA: A Foundational Language And Vision Alignment Model
Amanpreet Singh
Ronghang Hu
Vedanuj Goswami
Guillaume Couairon
Wojciech Galuba
Marcus Rohrbach
Douwe Kiela
CLIP
VLM
320
850
0
08 Dec 2021
Previous
1
2
3
...
23
24
25
26
27
Next