ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2026 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2103.15691
  4. Cited By
ViViT: A Video Vision Transformer
v1v2 (latest)

ViViT: A Video Vision Transformer

IEEE International Conference on Computer Vision (ICCV), 2021
29 March 2021
Anurag Arnab
Mostafa Dehghani
G. Heigold
Chen Sun
Mario Lucic
Cordelia Schmid
    ViT
ArXiv (abs)PDFHTMLHuggingFace (3 upvotes)Github (3544★)

Papers citing "ViViT: A Video Vision Transformer"

50 / 1,309 papers shown
Improving Video-Text Retrieval by Multi-Stream Corpus Alignment and Dual
  Softmax Loss
Improving Video-Text Retrieval by Multi-Stream Corpus Alignment and Dual Softmax Loss
Xingyi Cheng
Hezheng Lin
Xiangyu Wu
Fan Yang
Dong Shen
282
169
0
09 Sep 2021
Revisiting 3D ResNets for Video Recognition
Revisiting 3D ResNets for Video Recognition
Xianzhi Du
Yeqing Li
Huayu Chen
Rui Qian
Jing Li
Irwan Bello
239
20
0
03 Sep 2021
Shifted Chunk Transformer for Spatio-Temporal Representational Learning
Shifted Chunk Transformer for Spatio-Temporal Representational LearningNeural Information Processing Systems (NeurIPS), 2021
Xuefan Zha
Wentao Zhu
Tingxun Lv
Sen Yang
Ji Liu
AI4TSViT
289
30
0
26 Aug 2021
StarVQA: Space-Time Attention for Video Quality Assessment
StarVQA: Space-Time Attention for Video Quality AssessmentInternational Conference on Information Photonics (ICIP), 2021
Fengchuang Xing
Yuan-Gen Wang
Hanpin Wang
Leida Li
Guopu Zhu
ViT
80
27
0
22 Aug 2021
MM-ViT: Multi-Modal Video Transformer for Compressed Video Action
  Recognition
MM-ViT: Multi-Modal Video Transformer for Compressed Video Action Recognition
Jiawei Chen
C. Ho
ViT
258
100
0
20 Aug 2021
RaftMLP: How Much Can Be Done Without Attention and with Less Spatial
  Locality?
RaftMLP: How Much Can Be Done Without Attention and with Less Spatial Locality?Asian Conference on Computer Vision (ACCV), 2021
Yuki Tatsunami
Masato Taki
186
12
0
09 Aug 2021
EAN: Event Adaptive Network for Enhanced Action Recognition
EAN: Event Adaptive Network for Enhanced Action RecognitionInternational Journal of Computer Vision (IJCV), 2021
Yuan Tian
Manwen Liao
Guangtao Zhai
G. Guo
Zhiyong Gao
171
52
0
22 Jul 2021
CycleMLP: A MLP-like Architecture for Dense Prediction
CycleMLP: A MLP-like Architecture for Dense PredictionInternational Conference on Learning Representations (ICLR), 2021
Shoufa Chen
Enze Xie
Chongjian Ge
Runjian Chen
Ding Liang
Ping Luo
354
251
0
21 Jul 2021
Is attention to bounding boxes all you need for pedestrian action
  prediction?
Is attention to bounding boxes all you need for pedestrian action prediction?
Lina Achaji
Julien Moreau
Thibault Fouqueray
François Aioun
François Charpillet
229
41
0
16 Jul 2021
ViTGAN: Training GANs with Vision Transformers
ViTGAN: Training GANs with Vision TransformersInternational Conference on Learning Representations (ICLR), 2021
Kwonjoon Lee
Huiwen Chang
Lu Jiang
Han Zhang
Zhuowen Tu
Ce Liu
ViT
321
220
0
09 Jul 2021
Long Short-Term Transformer for Online Action Detection
Long Short-Term Transformer for Online Action DetectionNeural Information Processing Systems (NeurIPS), 2021
Mingze Xu
Yuanjun Xiong
Hao Chen
Xinyu Li
Wei Xia
Zhuowen Tu
Stefano Soatto
ViT
288
170
0
07 Jul 2021
VideoLightFormer: Lightweight Action Recognition using Transformers
Raivo Koot
Haiping Lu
ViT
232
9
0
01 Jul 2021
Attention Bottlenecks for Multimodal Fusion
Attention Bottlenecks for Multimodal FusionNeural Information Processing Systems (NeurIPS), 2021
Arsha Nagrani
Shan Yang
Anurag Arnab
A. Jansen
Cordelia Schmid
Chen Sun
576
690
0
30 Jun 2021
Can An Image Classifier Suffice For Action Recognition?
Can An Image Classifier Suffice For Action Recognition?International Conference on Learning Representations (ICLR), 2021
Quanfu Fan
Chun-Fu Chen
Chen
Yikang Shen
ViT
280
36
0
26 Jun 2021
Video Swin Transformer
Video Swin Transformer
Ze Liu
Jia Ning
Yue Cao
Yixuan Wei
Zheng Zhang
Stephen Lin
Han Hu
ViT
429
1,859
0
24 Jun 2021
Exploring Stronger Feature for Temporal Action Localization
Exploring Stronger Feature for Temporal Action Localization
Zhiwu Qing
Xiang Wang
Ziyuan Huang
Yutong Feng
Shiwei Zhang
Jianwen Jiang
Mingqian Tang
Changxin Gao
Nong Sang
128
4
0
24 Jun 2021
TokenLearner: What Can 8 Learned Tokens Do for Images and Videos?
TokenLearner: What Can 8 Learned Tokens Do for Images and Videos?
Michael S. Ryoo
A. Piergiovanni
Anurag Arnab
Mostafa Dehghani
A. Angelova
ViT
589
154
0
21 Jun 2021
VIMPAC: Video Pre-Training via Masked Token Prediction and Contrastive
  Learning
VIMPAC: Video Pre-Training via Masked Token Prediction and Contrastive Learning
Hao Tan
Jie Lei
Thomas Wolf
Joey Tianyi Zhou
203
73
0
21 Jun 2021
CLIP2Video: Mastering Video-Text Retrieval via Image CLIP
CLIP2Video: Mastering Video-Text Retrieval via Image CLIP
Han Fang
Pengfei Xiong
Luhui Xu
Yu Chen
CLIPVLM
315
343
0
21 Jun 2021
Weakly-Supervised Temporal Action Localization Through Local-Global
  Background Modeling
Weakly-Supervised Temporal Action Localization Through Local-Global Background Modeling
Xiang Wang
Zhiwu Qing
Ziyuan Huang
Yutong Feng
Shiwei Zhang
Jianwen Jiang
Mingqian Tang
Yuanjie Shao
Nong Sang
234
5
0
20 Jun 2021
Proposal Relation Network for Temporal Action Detection
Proposal Relation Network for Temporal Action Detection
Xiang Wang
Zhiwu Qing
Ziyuan Huang
Yutong Feng
Shiwei Zhang
Jianwen Jiang
Mingqian Tang
Changxin Gao
Nong Sang
ViT
117
27
0
20 Jun 2021
XCiT: Cross-Covariance Image Transformers
XCiT: Cross-Covariance Image TransformersNeural Information Processing Systems (NeurIPS), 2021
Alaaeldin El-Nouby
Hugo Touvron
Mathilde Caron
Piotr Bojanowski
Matthijs Douze
...
Ivan Laptev
Natalia Neverova
Gabriel Synnaeve
Jakob Verbeek
Edouard Grave
ViT
395
610
0
17 Jun 2021
Long-Short Temporal Contrastive Learning of Video Transformers
Long-Short Temporal Contrastive Learning of Video Transformers
Jue Wang
Gedas Bertasius
Du Tran
Lorenzo Torresani
VLMViT
283
56
0
17 Jun 2021
Relation Modeling in Spatio-Temporal Action Localization
Relation Modeling in Spatio-Temporal Action Localization
Yutong Feng
Jianwen Jiang
Ziyuan Huang
Zhiwu Qing
Xiang Wang
Shiwei Zhang
Mingqian Tang
Yue Gao
178
11
0
15 Jun 2021
A Stronger Baseline for Ego-Centric Action Detection
A Stronger Baseline for Ego-Centric Action Detection
Zhiwu Qing
Ziyuan Huang
Xiang Wang
Yutong Feng
Shiwei Zhang
Jianwen Jiang
Mingqian Tang
Changxin Gao
M. Ang
Nong Sang
EgoV
145
3
0
13 Jun 2021
Space-time Mixing Attention for Video Transformer
Space-time Mixing Attention for Video TransformerNeural Information Processing Systems (NeurIPS), 2021
Adrian Bulat
Juan-Manuel Perez-Rua
Swathikiran Sudhakaran
Brais Martínez
Georgios Tzimiropoulos
ViT
287
141
0
10 Jun 2021
Scaling Vision with Sparse Mixture of Experts
Scaling Vision with Sparse Mixture of ExpertsNeural Information Processing Systems (NeurIPS), 2021
C. Riquelme
J. Puigcerver
Basil Mustafa
Maxim Neumann
Rodolphe Jenatton
André Susano Pinto
Daniel Keysers
N. Houlsby
MoE
309
834
0
10 Jun 2021
Keeping Your Eye on the Ball: Trajectory Attention in Video Transformers
Keeping Your Eye on the Ball: Trajectory Attention in Video TransformersNeural Information Processing Systems (NeurIPS), 2021
Mandela Patrick
Dylan Campbell
Yuki M. Asano
Ishan Misra
Ishan Misra Florian Metze
Christoph Feichtenhofer
Andrea Vedaldi
João F. Henriques
279
339
0
09 Jun 2021
Towards Training Stronger Video Vision Transformers for
  EPIC-KITCHENS-100 Action Recognition
Towards Training Stronger Video Vision Transformers for EPIC-KITCHENS-100 Action Recognition
Ziyuan Huang
Zhiwu Qing
Xiang Wang
Yutong Feng
Shiwei Zhang
Jianwen Jiang
Zhurong Xia
Mingqian Tang
Nong Sang
M. Ang
ViT
126
13
0
09 Jun 2021
A Survey of Transformers
A Survey of TransformersAI Open (AO), 2021
Tianyang Lin
Yuxin Wang
Xiangyang Liu
Xipeng Qiu
ViT
441
1,380
0
08 Jun 2021
SIMONe: View-Invariant, Temporally-Abstracted Object Representations via
  Unsupervised Video Decomposition
SIMONe: View-Invariant, Temporally-Abstracted Object Representations via Unsupervised Video DecompositionNeural Information Processing Systems (NeurIPS), 2021
Rishabh Kabra
Daniel Zoran
Goker Erdogan
Loic Matthey
Antonia Creswell
M. Botvinick
Alexander Lerchner
Christopher P. Burgess
OCL
296
85
0
07 Jun 2021
On the Expressive Power of Self-Attention Matrices
On the Expressive Power of Self-Attention Matrices
Valerii Likhosherstov
K. Choromanski
Adrian Weller
344
43
0
07 Jun 2021
Video Instance Segmentation using Inter-Frame Communication Transformers
Video Instance Segmentation using Inter-Frame Communication TransformersNeural Information Processing Systems (NeurIPS), 2021
Sukjun Hwang
Miran Heo
Seoung Wug Oh
Seon Joo Kim
ViT
245
158
0
07 Jun 2021
Transformed ROIs for Capturing Visual Transformations in Videos
Transformed ROIs for Capturing Visual Transformations in VideosComputer Vision and Image Understanding (CVIU), 2021
Abhinav Rai
Fadime Sener
Angela Yao
ViT
221
4
0
06 Jun 2021
CAPE: Encoding Relative Positions with Continuous Augmented Positional
  Embeddings
CAPE: Encoding Relative Positions with Continuous Augmented Positional EmbeddingsNeural Information Processing Systems (NeurIPS), 2021
Tatiana Likhomanenko
Qiantong Xu
Gabriel Synnaeve
R. Collobert
A. Rogozhnikov
OODViT
326
70
0
06 Jun 2021
Signal Transformer: Complex-valued Attention and Meta-Learning for
  Signal Recognition
Signal Transformer: Complex-valued Attention and Meta-Learning for Signal RecognitionIEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2021
Yihong Dong
Ying Peng
Muqiao Yang
Songtao Lu
Qingjiang Shi
399
12
0
05 Jun 2021
Anticipative Video Transformer
Anticipative Video TransformerIEEE International Conference on Computer Vision (ICCV), 2021
Rohit Girdhar
Kristen Grauman
ViT
328
249
0
03 Jun 2021
When Vision Transformers Outperform ResNets without Pre-training or
  Strong Data Augmentations
When Vision Transformers Outperform ResNets without Pre-training or Strong Data AugmentationsInternational Conference on Learning Representations (ICLR), 2021
Xiangning Chen
Cho-Jui Hsieh
Boqing Gong
ViT
367
373
0
03 Jun 2021
Continual 3D Convolutional Neural Networks for Real-time Processing of
  Videos
Continual 3D Convolutional Neural Networks for Real-time Processing of VideosEuropean Conference on Computer Vision (ECCV), 2021
Lukas Hedegaard
Alexandros Iosifidis
3DPC
322
19
0
31 May 2021
Gaze Estimation using Transformer
Gaze Estimation using TransformerInternational Conference on Pattern Recognition (ICPR), 2021
Yihua Cheng
Feng Lu
ViT
213
131
0
30 May 2021
FineAction: A Fine-Grained Video Dataset for Temporal Action
  Localization
FineAction: A Fine-Grained Video Dataset for Temporal Action LocalizationIEEE Transactions on Image Processing (TIP), 2021
Lu Dong
Limin Wang
Yali Wang
Xiao Ma
Yu Qiao
272
77
0
24 May 2021
Segmenter: Transformer for Semantic Segmentation
Segmenter: Transformer for Semantic SegmentationIEEE International Conference on Computer Vision (ICCV), 2021
Robin Strudel
Ricardo Garcia Pinel
Ivan Laptev
Cordelia Schmid
ViT
721
1,771
0
12 May 2021
A Fast Partial Video Copy Detection Using KNN and Global Feature
  Database
A Fast Partial Video Copy Detection Using KNN and Global Feature DatabaseIEEE Workshop/Winter Conference on Applications of Computer Vision (WACV), 2021
Weijun Tan
Hongwei Guo
Rushuai Liu
250
13
0
04 May 2021
Vision Transformers with Patch Diversification
Vision Transformers with Patch Diversification
Chengyue Gong
Dilin Wang
Meng Li
Vikas Chandra
Qiang Liu
ViT
253
68
0
26 Apr 2021
VidTr: Video Transformer Without Convolutions
VidTr: Video Transformer Without ConvolutionsIEEE International Conference on Computer Vision (ICCV), 2021
Yanyi Zhang
Xinyu Li
Chunhui Liu
Bing Shuai
Yi Zhu
Biagio Brattoli
Hao Chen
I. Marsic
Joseph Tighe
ViT
415
215
0
23 Apr 2021
Multiscale Vision Transformers
Multiscale Vision TransformersIEEE International Conference on Computer Vision (ICCV), 2021
Haoqi Fan
Bo Xiong
K. Mangalam
Yanghao Li
Zhicheng Yan
Jitendra Malik
Christoph Feichtenhofer
ViT
481
1,503
0
22 Apr 2021
VATT: Transformers for Multimodal Self-Supervised Learning from Raw
  Video, Audio and Text
VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and TextNeural Information Processing Systems (NeurIPS), 2021
Hassan Akbari
Liangzhe Yuan
Rui Qian
Wei-Hong Chuang
Shih-Fu Chang
Huayu Chen
Boqing Gong
ViT
720
677
0
22 Apr 2021
Writing in The Air: Unconstrained Text Recognition from Finger Movement
  Using Spatio-Temporal Convolution
Writing in The Air: Unconstrained Text Recognition from Finger Movement Using Spatio-Temporal ConvolutionIEEE Transactions on Artificial Intelligence (IEEE TAI), 2021
Ue-Hwan Kim
Yewon Hwang
Sun-Kyung Lee
Jong-Hwan Kim
150
23
0
19 Apr 2021
CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip
  Retrieval
CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval
Huaishao Luo
Lei Ji
Ming Zhong
Yang Chen
Wen Lei
Nan Duan
Tianrui Li
CLIPVLM
1.4K
1,001
0
18 Apr 2021
Higher Order Recurrent Space-Time Transformer for Video Action
  Prediction
Higher Order Recurrent Space-Time Transformer for Video Action Prediction
Tsung-Ming Tai
G. Fiameni
Cheng-Kuang Lee
Oswald Lanz
180
11
0
17 Apr 2021
Previous
123...252627
Next