ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2026 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2103.15691
  4. Cited By
ViViT: A Video Vision Transformer
v1v2 (latest)

ViViT: A Video Vision Transformer

IEEE International Conference on Computer Vision (ICCV), 2021
29 March 2021
Anurag Arnab
Mostafa Dehghani
G. Heigold
Chen Sun
Mario Lucic
Cordelia Schmid
    ViT
ArXiv (abs)PDFHTMLHuggingFace (3 upvotes)Github (3544★)

Papers citing "ViViT: A Video Vision Transformer"

50 / 1,308 papers shown
Title
Co-training Transformer with Videos and Images Improves Action
  Recognition
Co-training Transformer with Videos and Images Improves Action Recognition
Bowen Zhang
Jiahui Yu
Christopher Fifty
Wei Han
Andrew M. Dai
Ruoming Pang
Fei Sha
ViT
148
62
0
14 Dec 2021
Translating Human Mobility Forecasting through Natural Language
  Generation
Translating Human Mobility Forecasting through Natural Language Generation
Hao Xue
Flora D. Salim
Yongli Ren
C. Clarke
AI4TS
127
25
0
13 Dec 2021
Learning Generalizable Vision-Tactile Robotic Grasping Strategy for
  Deformable Objects via Transformer
Learning Generalizable Vision-Tactile Robotic Grasping Strategy for Deformable Objects via Transformer
Yunhai Han
Kelin Yu
Rahul Batra
Nathan Boyd
Chaitanya Mehta
T. Zhao
Y. She
S. Hutchinson
Ye Zhao
ViT
449
68
0
13 Dec 2021
DualFormer: Local-Global Stratified Transformer for Efficient Video
  Recognition
DualFormer: Local-Global Stratified Transformer for Efficient Video Recognition
Yuxuan Liang
Pan Zhou
Roger Zimmermann
Shuicheng Yan
ViT
168
24
0
09 Dec 2021
MASTAF: A Model-Agnostic Spatio-Temporal Attention Fusion Network for
  Few-shot Video Classification
MASTAF: A Model-Agnostic Spatio-Temporal Attention Fusion Network for Few-shot Video Classification
Rex Liu
Huan Zhang
Hamed Pirsiavash
Xin Liu
ViT
256
16
0
08 Dec 2021
FLAVA: A Foundational Language And Vision Alignment Model
FLAVA: A Foundational Language And Vision Alignment Model
Amanpreet Singh
Ronghang Hu
Vedanuj Goswami
Guillaume Couairon
Wojciech Galuba
Marcus Rohrbach
Douwe Kiela
CLIPVLM
340
855
0
08 Dec 2021
Prompting Visual-Language Models for Efficient Video Understanding
Prompting Visual-Language Models for Efficient Video Understanding
Chen Ju
Tengda Han
Kunhao Zheng
Ya Zhang
Weidi Xie
VPVLMVLM
352
457
0
08 Dec 2021
MS-TCT: Multi-Scale Temporal ConvTransformer for Action Detection
MS-TCT: Multi-Scale Temporal ConvTransformer for Action Detection
Rui Dai
Srijan Das
Kumara Kahatapitiya
Michael S. Ryoo
Francois Bremond
ViT
265
95
0
07 Dec 2021
LAVT: Language-Aware Vision Transformer for Referring Image Segmentation
LAVT: Language-Aware Vision Transformer for Referring Image SegmentationComputer Vision and Pattern Recognition (CVPR), 2021
Zhao Yang
Yuan Liu
Yansong Tang
Kai-xiang Chen
Hengshuang Zhao
Juil Sock
809
416
0
04 Dec 2021
BEVT: BERT Pretraining of Video Transformers
BEVT: BERT Pretraining of Video Transformers
Rui Wang
Dongdong Chen
Zuxuan Wu
Yinpeng Chen
Xiyang Dai
Xiyang Dai
Yu-Gang Jiang
Luowei Zhou
Lu Yuan
ViT
244
245
0
02 Dec 2021
MViTv2: Improved Multiscale Vision Transformers for Classification and
  Detection
MViTv2: Improved Multiscale Vision Transformers for Classification and Detection
Yanghao Li
Chaoxia Wu
Haoqi Fan
K. Mangalam
Bo Xiong
Jitendra Malik
Christoph Feichtenhofer
ViT
447
834
0
02 Dec 2021
Uni-Perceiver: Pre-training Unified Architecture for Generic Perception
  for Zero-shot and Few-shot Tasks
Uni-Perceiver: Pre-training Unified Architecture for Generic Perception for Zero-shot and Few-shot Tasks
Xizhou Zhu
Jinguo Zhu
Hao Li
Xiaoshi Wu
Xiaogang Wang
Jiaming Song
Xiaohua Wang
Jifeng Dai
232
151
0
02 Dec 2021
Self-supervised Video Transformer
Self-supervised Video Transformer
Kanchana Ranasinghe
Muzammal Naseer
Salman Khan
Fahad Shahbaz Khan
Michael S. Ryoo
ViT
304
108
0
02 Dec 2021
Video-Text Pre-training with Learned Regions
Video-Text Pre-training with Learned Regions
Rui Yan
Mike Zheng Shou
Yixiao Ge
Alex Jinpeng Wang
Xudong Lin
Guanyu Cai
Jinhui Tang
240
27
0
02 Dec 2021
Multi-domain Integrative Swin Transformer network for Sparse-View
  Tomographic Reconstruction
Multi-domain Integrative Swin Transformer network for Sparse-View Tomographic Reconstruction
Jiayi Pan
Heye Zhang
Weifei Wu
Zijian Gao
Weiwen Wu
326
72
0
28 Nov 2021
SWAT: Spatial Structure Within and Among Tokens
SWAT: Spatial Structure Within and Among TokensInternational Joint Conference on Artificial Intelligence (IJCAI), 2021
Kumara Kahatapitiya
Michael S. Ryoo
236
7
0
26 Nov 2021
Semi-Supervised Music Tagging Transformer
Semi-Supervised Music Tagging TransformerInternational Society for Music Information Retrieval Conference (ISMIR), 2021
Minz Won
Keunwoo Choi
Xavier Serra
ViTMedIm
461
51
0
26 Nov 2021
A Robust Volumetric Transformer for Accurate 3D Tumor Segmentation
A Robust Volumetric Transformer for Accurate 3D Tumor SegmentationInternational Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI), 2021
Himashi Peiris
Munawar Hayat
Zhaolin Chen
Gary Egan
Mehrtash Harandi
ViTMedIm
194
178
0
26 Nov 2021
SwinBERT: End-to-End Transformers with Sparse Attention for Video
  Captioning
SwinBERT: End-to-End Transformers with Sparse Attention for Video CaptioningComputer Vision and Pattern Recognition (CVPR), 2021
Kevin Qinghong Lin
Linjie Li
Chung-Ching Lin
Faisal Ahmed
Zhe Gan
Zicheng Liu
Yumao Lu
Lijuan Wang
ViT
302
296
0
25 Nov 2021
PolyViT: Co-training Vision Transformers on Images, Videos and Audio
PolyViT: Co-training Vision Transformers on Images, Videos and Audio
Valerii Likhosherstov
Anurag Arnab
K. Choromanski
Mario Lucic
Yi Tay
Adrian Weller
Mostafa Dehghani
ViT
184
82
0
25 Nov 2021
MorphMLP: An Efficient MLP-Like Backbone for Spatial-Temporal
  Representation Learning
MorphMLP: An Efficient MLP-Like Backbone for Spatial-Temporal Representation Learning
David Junhao Zhang
Kunchang Li
Yali Wang
Yuxiang Chen
Shashwat Chandra
Yu Qiao
Luoqi Liu
Mike Zheng Shou
AI4TS
179
35
0
24 Nov 2021
PhysFormer: Facial Video-based Physiological Measurement with Temporal
  Difference Transformer
PhysFormer: Facial Video-based Physiological Measurement with Temporal Difference Transformer
Zitong Yu
Yuming Shen
Jingang Shi
Hengshuang Zhao
Juil Sock
Guoying Zhao
ViTMedIm
330
237
0
23 Nov 2021
Efficient Video Transformers with Spatial-Temporal Token Selection
Efficient Video Transformers with Spatial-Temporal Token Selection
Junke Wang
Xitong Yang
Hengduo Li
Li Liu
Zuxuan Wu
Yu-Gang Jiang
ViT
158
82
0
23 Nov 2021
Ice hockey player identification via transformers and weakly supervised
  learning
Ice hockey player identification via transformers and weakly supervised learning
Kanav Vats
William J. McNally
Pascale Walters
David A Clausi
John S. Zelek
ViT
151
27
0
22 Nov 2021
Florence: A New Foundation Model for Computer Vision
Florence: A New Foundation Model for Computer Vision
Lu Yuan
Dongdong Chen
Yi-Ling Chen
Noel Codella
Xiyang Dai
...
Zhen Xiao
Jianwei Yang
Michael Zeng
Luowei Zhou
Pengchuan Zhang
VLM
377
1,043
0
22 Nov 2021
Exploring Segment-level Semantics for Online Phase Recognition from
  Surgical Videos
Exploring Segment-level Semantics for Online Phase Recognition from Surgical VideosIEEE Transactions on Medical Imaging (IEEE TMI), 2021
Xinpeng Ding
Xiaomeng Li
302
45
0
22 Nov 2021
Swin Transformer V2: Scaling Up Capacity and Resolution
Swin Transformer V2: Scaling Up Capacity and Resolution
Ze Liu
Han Hu
Yutong Lin
Zhuliang Yao
Zhenda Xie
...
Yue Cao
Zheng Zhang
Li Dong
Furu Wei
B. Guo
ViT
493
2,375
0
18 Nov 2021
Evaluating Transformers for Lightweight Action Recognition
Evaluating Transformers for Lightweight Action Recognition
Raivo Koot
Markus Hennerbichler
Haiping Lu
ViT
204
8
0
18 Nov 2021
Benchmarking and scaling of deep learning models for land cover image
  classification
Benchmarking and scaling of deep learning models for land cover image classification
Ioannis Papoutsis
Nikolaos Ioannis Bountos
Angelos Zavras
Dimitrios Michail
Christos Tryfonopoulos
421
71
0
18 Nov 2021
Mask-guided Spectral-wise Transformer for Efficient Hyperspectral Image
  Reconstruction
Mask-guided Spectral-wise Transformer for Efficient Hyperspectral Image ReconstructionComputer Vision and Pattern Recognition (CVPR), 2021
Yuanhao Cai
Jing Lin
Xiaowan Hu
Haoqian Wang
X. Yuan
Yulun Zhang
Radu Timofte
Luc Van Gool
138
336
0
15 Nov 2021
Relational Self-Attention: What's Missing in Attention for Video
  Understanding
Relational Self-Attention: What's Missing in Attention for Video UnderstandingNeural Information Processing Systems (NeurIPS), 2021
Manjin Kim
Heeseung Kwon
Chunyu Wang
Suha Kwak
Minsu Cho
ViT
158
36
0
02 Nov 2021
With a Little Help from my Temporal Context: Multimodal Egocentric
  Action Recognition
With a Little Help from my Temporal Context: Multimodal Egocentric Action RecognitionBritish Machine Vision Conference (BMVC), 2021
Evangelos Kazakos
Jaesung Huh
Arsha Nagrani
Andrew Zisserman
Dima Damen
EgoV
264
54
0
01 Nov 2021
Blending Anti-Aliasing into Vision Transformer
Blending Anti-Aliasing into Vision TransformerNeural Information Processing Systems (NeurIPS), 2021
Shengju Qian
Hao Shao
Yi Zhu
Mu Li
Jiaya Jia
183
23
0
28 Oct 2021
History Aware Multimodal Transformer for Vision-and-Language Navigation
History Aware Multimodal Transformer for Vision-and-Language Navigation
Shizhe Chen
Pierre-Louis Guhur
Cordelia Schmid
Ivan Laptev
LM&Ro
283
303
0
25 Oct 2021
The Efficiency Misnomer
The Efficiency MisnomerInternational Conference on Learning Representations (ICLR), 2021
Daoyuan Chen
Liuyi Yao
Dawei Gao
Ashish Vaswani
Yaliang Li
267
112
0
25 Oct 2021
SCENIC: A JAX Library for Computer Vision Research and Beyond
SCENIC: A JAX Library for Computer Vision Research and Beyond
Mostafa Dehghani
A. Gritsenko
Anurag Arnab
Matthias Minderer
Yi Tay
190
75
0
18 Oct 2021
Object-Region Video Transformers
Object-Region Video Transformers
Roei Herzig
Elad Ben-Avraham
K. Mangalam
Amir Bar
Gal Chechik
Anna Rohrbach
Trevor Darrell
Amir Globerson
ViT
353
94
0
13 Oct 2021
StARformer: Transformer with State-Action-Reward Representations for
  Visual Reinforcement Learning
StARformer: Transformer with State-Action-Reward Representations for Visual Reinforcement LearningEuropean Conference on Computer Vision (ECCV), 2021
Jinghuan Shang
Kumara Kahatapitiya
Xiang Li
Michael S. Ryoo
OffRL
367
40
0
12 Oct 2021
TAda! Temporally-Adaptive Convolutions for Video Understanding
TAda! Temporally-Adaptive Convolutions for Video UnderstandingInternational Conference on Learning Representations (ICLR), 2021
Ziyuan Huang
Shiwei Zhang
Liang Pan
Zhiwu Qing
Mingqian Tang
Ziwei Liu
M. Ang
394
67
0
12 Oct 2021
Multi-Modal Pre-Training for Automated Speech Recognition
Multi-Modal Pre-Training for Automated Speech RecognitionIEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2021
David M. Chan
Shalini Ghosh
D. Chakrabarty
Björn Hoffmeister
SSL
208
16
0
12 Oct 2021
Video Is Graph: Structured Graph Module for Video Action Recognition
Video Is Graph: Structured Graph Module for Video Action Recognition
Rongjie Li
Xiaojun Wu
Tianyang Xu
326
14
0
12 Oct 2021
EfficientPhys: Enabling Simple, Fast and Accurate Camera-Based Vitals
  Measurement
EfficientPhys: Enabling Simple, Fast and Accurate Camera-Based Vitals MeasurementIEEE Workshop/Winter Conference on Applications of Computer Vision (WACV), 2021
Xin Liu
B. Hill
Ziheng Jiang
Shwetak N. Patel
Daniel J. McDuff
3DHMedIm
271
132
0
09 Oct 2021
Exploring the Limits of Large Scale Pre-training
Exploring the Limits of Large Scale Pre-training
Samira Abnar
Mostafa Dehghani
Behnam Neyshabur
Hanie Sedghi
AI4CE
201
133
0
05 Oct 2021
PETA: Photo Albums Event Recognition using Transformers Attention
PETA: Photo Albums Event Recognition using Transformers AttentionInternational Conference on Pattern Recognition (ICPR), 2021
Tamar Glaser
Emanuel Ben-Baruch
Gilad Sharir
Nadav Zamir
Asaf Noy
Lihi Zelnik-Manor
ViT
113
2
0
26 Sep 2021
Long-Range Transformers for Dynamic Spatiotemporal Forecasting
Long-Range Transformers for Dynamic Spatiotemporal Forecasting
J. E. Grigsby
Zhe Wang
Nam Nguyen
Yanjun Qi
AI4TS
305
115
0
24 Sep 2021
Scale Efficiently: Insights from Pre-training and Fine-tuning
  Transformers
Scale Efficiently: Insights from Pre-training and Fine-tuning Transformers
Yi Tay
Mostafa Dehghani
J. Rao
W. Fedus
Samira Abnar
Hyung Won Chung
Sharan Narang
Dani Yogatama
Ashish Vaswani
Donald Metzler
825
135
0
22 Sep 2021
Audio-Visual Speech Recognition is Worth 32$\times$32$\times$8 Voxels
Audio-Visual Speech Recognition is Worth 32×\times×32×\times×8 Voxels
Dmitriy Serdyuk
Otavio Braga
Olivier Siohan
ViT
163
7
0
20 Sep 2021
ActionCLIP: A New Paradigm for Video Action Recognition
ActionCLIP: A New Paradigm for Video Action Recognition
Mengmeng Wang
Jiazheng Xing
Yong Liu
VLM
360
459
0
17 Sep 2021
Is Attention Better Than Matrix Decomposition?
Is Attention Better Than Matrix Decomposition?International Conference on Learning Representations (ICLR), 2021
Zhengyang Geng
Meng-Hao Guo
Hongxu Chen
Xia Li
Ke Wei
Zhouchen Lin
251
166
0
09 Sep 2021
Improving Video-Text Retrieval by Multi-Stream Corpus Alignment and Dual
  Softmax Loss
Improving Video-Text Retrieval by Multi-Stream Corpus Alignment and Dual Softmax Loss
Xingyi Cheng
Hezheng Lin
Xiangyu Wu
Fan Yang
Dong Shen
241
169
0
09 Sep 2021
Previous
123...24252627
Next