ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2103.15691
  4. Cited By
ViViT: A Video Vision Transformer
v1v2 (latest)

ViViT: A Video Vision Transformer

IEEE International Conference on Computer Vision (ICCV), 2021
29 March 2021
Anurag Arnab
Mostafa Dehghani
G. Heigold
Chen Sun
Mario Lucic
Cordelia Schmid
    ViT
ArXiv (abs)PDFHTMLHuggingFace (3 upvotes)Github (3544★)

Papers citing "ViViT: A Video Vision Transformer"

50 / 1,302 papers shown
Title
Efficient Video Instance Segmentation via Tracklet Query and Proposal
Efficient Video Instance Segmentation via Tracklet Query and ProposalComputer Vision and Pattern Recognition (CVPR), 2022
Jialian Wu
Sudhir Yarram
Hui Liang
Tian Lan
Junsong Yuan
J. Eledath
Gérard Medioni
175
43
0
03 Mar 2022
Multi-Tailed Vision Transformer for Efficient Inference
Multi-Tailed Vision Transformer for Efficient InferenceNeural Networks (NN), 2022
Yunke Wang
Bo Du
Wenyuan Wang
Chang Xu
ViT
518
10
0
03 Mar 2022
ViTransPAD: Video Transformer using convolution and self-attention for
  Face Presentation Attack Detection
ViTransPAD: Video Transformer using convolution and self-attention for Face Presentation Attack DetectionInternational Conference on Information Photonics (ICIP), 2022
Zuheng Ming
Zitong Yu
M. Al-Ghadi
M. Visani
M. Luqman
J. Burie
ViTCVBM
129
24
0
03 Mar 2022
TransDARC: Transformer-based Driver Activity Recognition with Latent
  Space Feature Calibration
TransDARC: Transformer-based Driver Activity Recognition with Latent Space Feature CalibrationIEEE/RJS International Conference on Intelligent RObots and Systems (IROS), 2022
Kunyu Peng
Alina Roitberg
Kailun Yang
Kailai Li
Rainer Stiefelhagen
ViT
153
39
0
02 Mar 2022
FastFold: Reducing AlphaFold Training Time from 11 Days to 67 Hours
FastFold: Reducing AlphaFold Training Time from 11 Days to 67 Hours
Shenggan Cheng
Xuanlei Zhao
Guangyang Lu
Bin-Rui Li
Zhongming Yu
Tian Zheng
R. Wu
Xiwen Zhang
Jian Peng
Yang You
AI4CE
189
35
0
02 Mar 2022
Temporal Perceiver: A General Architecture for Arbitrary Boundary
  Detection
Temporal Perceiver: A General Architecture for Arbitrary Boundary DetectionIEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2022
Jing Tan
Yuhong Wang
Gangshan Wu
Limin Wang
187
19
0
01 Mar 2022
On Modality Bias Recognition and Reduction
On Modality Bias Recognition and Reduction
Yangyang Guo
Liqiang Nie
Harry Cheng
Zhiyong Cheng
Mohan S. Kankanhalli
Marco Bertini
230
48
0
25 Feb 2022
Motion-driven Visual Tempo Learning for Video-based Action Recognition
Motion-driven Visual Tempo Learning for Video-based Action RecognitionIEEE Transactions on Image Processing (IEEE TIP), 2022
Yuanzhong Liu
Junsong Yuan
Zhigang Tu
183
77
0
24 Feb 2022
Delving Deep into One-Shot Skeleton-based Action Recognition with
  Diverse Occlusions
Delving Deep into One-Shot Skeleton-based Action Recognition with Diverse OcclusionsIEEE transactions on multimedia (IEEE TMM), 2022
Kunyu Peng
Alina Roitberg
Kailun Yang
Kailai Li
Rainer Stiefelhagen
ViT
309
40
0
23 Feb 2022
GroupViT: Semantic Segmentation Emerges from Text Supervision
GroupViT: Semantic Segmentation Emerges from Text SupervisionComputer Vision and Pattern Recognition (CVPR), 2022
Jiarui Xu
Shalini De Mello
Sifei Liu
Wonmin Byeon
Thomas Breuel
Jan Kautz
Xinyu Wang
ViTVLM
635
622
0
22 Feb 2022
HiP: Hierarchical Perceiver
HiP: Hierarchical Perceiver
João Carreira
Skanda Koppula
Daniel Zoran
Adrià Recasens
Catalin Ionescu
...
M. Botvinick
Oriol Vinyals
Karen Simonyan
Andrew Zisserman
Andrew Jaegle
VLM
315
14
0
22 Feb 2022
Movies2Scenes: Using Movie Metadata to Learn Scene Representation
Movies2Scenes: Using Movie Metadata to Learn Scene RepresentationComputer Vision and Pattern Recognition (CVPR), 2022
Shixing Chen
Chundi Liu
Xiang Hao
Xiaohan Nie
Maxim Arap
Raffay Hamid
180
17
0
22 Feb 2022
ActionFormer: Localizing Moments of Actions with Transformers
ActionFormer: Localizing Moments of Actions with TransformersEuropean Conference on Computer Vision (ECCV), 2022
Chen-Da Liu-Zhang
Jianxin Wu
Yin Li
ViT
257
435
0
16 Feb 2022
Characterizing and overcoming the greedy nature of learning in
  multi-modal deep neural networks
Characterizing and overcoming the greedy nature of learning in multi-modal deep neural networksInternational Conference on Machine Learning (ICML), 2022
Nan Wu
Stanislaw Jastrzebski
Dong Wang
Krzysztof J. Geras
125
108
0
10 Feb 2022
OWL (Observe, Watch, Listen): Audiovisual Temporal Context for
  Localizing Actions in Egocentric Videos
OWL (Observe, Watch, Listen): Audiovisual Temporal Context for Localizing Actions in Egocentric Videos
Merey Ramazanova
Victor Escorcia
Fabian Caba Heilbron
Chen Zhao
Guohao Li
179
4
0
10 Feb 2022
Video Violence Recognition and Localization Using a Semi-Supervised Hard
  Attention Model
Video Violence Recognition and Localization Using a Semi-Supervised Hard Attention ModelExpert systems with applications (ESWA), 2022
Hamid Reza Mohammadi
Ehsan Nazerfard
390
33
0
04 Feb 2022
VRT: A Video Restoration Transformer
VRT: A Video Restoration TransformerIEEE Transactions on Image Processing (IEEE TIP), 2022
Christos Sakaridis
Jingyun Liang
Yuchen Fan
Lucas Beerens
Rakesh Ranjan
Yawei Li
Radu Timofte
Luc Van Gool
ViT
318
325
0
28 Jan 2022
Learning To Recognize Procedural Activities with Distant Supervision
Learning To Recognize Procedural Activities with Distant SupervisionComputer Vision and Pattern Recognition (CVPR), 2022
Xudong Lin
Fabio Petroni
Gedas Bertasius
Marcus Rohrbach
Shih-Fu Chang
Lorenzo Torresani
222
94
0
26 Jan 2022
Predicting Knee Osteoarthritis Progression from Structural MRI using
  Deep Learning
Predicting Knee Osteoarthritis Progression from Structural MRI using Deep LearningIEEE International Symposium on Biomedical Imaging (ISBI), 2022
E. Panfilov
S. Saarakkala
M. Nieminen
A. Tiulpin
MedIm
254
16
0
26 Jan 2022
Transformer-Based Video Front-Ends for Audio-Visual Speech Recognition
  for Single and Multi-Person Video
Transformer-Based Video Front-Ends for Audio-Visual Speech Recognition for Single and Multi-Person VideoInterspeech (Interspeech), 2022
Dmitriy Serdyuk
Otavio Braga
Olivier Siohan
ViT
270
45
0
25 Jan 2022
Transformers in Medical Imaging: A Survey
Transformers in Medical Imaging: A Survey
Fahad Shamshad
Salman Khan
Syed Waqas Zamir
Muhammad Haris Khan
Munawar Hayat
Fahad Shahbaz Khan
Huazhu Fu
ViTLM&MAMedIm
272
917
0
24 Jan 2022
UniFormer: Unifying Convolution and Self-attention for Visual
  Recognition
UniFormer: Unifying Convolution and Self-attention for Visual RecognitionIEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2022
Kunchang Li
Yali Wang
Junhao Zhang
Shiyang Feng
Guanglu Song
Yu Liu
Jiaming Song
Yu Qiao
ViT
423
509
0
24 Jan 2022
MeMViT: Memory-Augmented Multiscale Vision Transformer for Efficient
  Long-Term Video Recognition
MeMViT: Memory-Augmented Multiscale Vision Transformer for Efficient Long-Term Video RecognitionComputer Vision and Pattern Recognition (CVPR), 2022
Chao-Yuan Wu
Yanghao Li
K. Mangalam
Haoqi Fan
Bo Xiong
Jitendra Malik
Christoph Feichtenhofer
ViT
349
242
0
20 Jan 2022
Omnivore: A Single Model for Many Visual Modalities
Omnivore: A Single Model for Many Visual ModalitiesComputer Vision and Pattern Recognition (CVPR), 2022
Rohit Girdhar
Mannat Singh
Nikhil Ravi
Laurens van der Maaten
Armand Joulin
Ishan Misra
485
283
0
20 Jan 2022
End-to-end Generative Pretraining for Multimodal Video Captioning
End-to-end Generative Pretraining for Multimodal Video CaptioningComputer Vision and Pattern Recognition (CVPR), 2022
Paul Hongsuck Seo
Arsha Nagrani
Anurag Arnab
Cordelia Schmid
232
184
0
20 Jan 2022
Continual Transformers: Redundancy-Free Attention for Online Inference
Continual Transformers: Redundancy-Free Attention for Online InferenceInternational Conference on Learning Representations (ICLR), 2022
Lukas Hedegaard
Arian Bakhtiarnia
Alexandros Iosifidis
CLL
360
14
0
17 Jan 2022
Video Transformers: A Survey
Video Transformers: A SurveyIEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2022
Javier Selva
A. S. Johansen
Sergio Escalera
Kamal Nasrollahi
T. Moeslund
Albert Clapés
ViT
374
132
0
16 Jan 2022
Transformers in Action: Weakly Supervised Action Segmentation
Transformers in Action: Weakly Supervised Action Segmentation
John Ridley
Huseyin Coskun
D. Tan
Nassir Navab
F. Tombari
ViT
133
5
0
14 Jan 2022
ViT2Hash: Unsupervised Information-Preserving Hashing
ViT2Hash: Unsupervised Information-Preserving Hashing
Qinkang Gong
Liangdao Wang
Hanjiang Lai
Yan Pan
Jian Yin
95
5
0
14 Jan 2022
UniFormer: Unified Transformer for Efficient Spatiotemporal
  Representation Learning
UniFormer: Unified Transformer for Efficient Spatiotemporal Representation LearningInternational Conference on Learning Representations (ICLR), 2022
Kunchang Li
Yali Wang
Shiyang Feng
Guanglu Song
Yu Liu
Jiaming Song
Yu Qiao
ViT
375
318
0
12 Jan 2022
Multiview Transformers for Video Recognition
Multiview Transformers for Video RecognitionComputer Vision and Pattern Recognition (CVPR), 2022
Shen Yan
Xuehan Xiong
Anurag Arnab
Zhichao Lu
Mi Zhang
Chen Sun
Cordelia Schmid
ViT
346
263
0
12 Jan 2022
MAXIM: Multi-Axis MLP for Image Processing
MAXIM: Multi-Axis MLP for Image ProcessingComputer Vision and Pattern Recognition (CVPR), 2022
Zhengzhong Tu
Hossein Talebi
Han Zhang
Feng Yang
P. Milanfar
A. Bovik
Yinxiao Li
232
620
0
09 Jan 2022
Spatio-Temporal Tuples Transformer for Skeleton-Based Action Recognition
Spatio-Temporal Tuples Transformer for Skeleton-Based Action Recognition
Helei Qiu
B. Hou
Bo Ren
Xiaohua Zhang
ViT
182
61
0
08 Jan 2022
Flow-Guided Sparse Transformer for Video Deblurring
Flow-Guided Sparse Transformer for Video DeblurringInternational Conference on Machine Learning (ICML), 2022
Jing Lin
Yuanhao Cai
Xiaowan Hu
Haoqian Wang
Youliang Yan
X. Zou
Henghui Ding
Yulun Zhang
Radu Timofte
Luc Van Gool
ViT
171
73
0
06 Jan 2022
Lawin Transformer: Improving Semantic Segmentation Transformer with
  Multi-Scale Representations via Large Window Attention
Lawin Transformer: Improving Semantic Segmentation Transformer with Multi-Scale Representations via Large Window Attention
Haotian Yan
Chuang Zhang
Ming Wu
ViT
304
75
0
05 Jan 2022
RFormer: Transformer-based Generative Adversarial Network for Real
  Fundus Image Restoration on A New Clinical Benchmark
RFormer: Transformer-based Generative Adversarial Network for Real Fundus Image Restoration on A New Clinical BenchmarkIEEE journal of biomedical and health informatics (IEEE JBHI), 2022
Zhuo Deng
Yuanhao Cai
Lu Chen
Zheng Gong
Qiqi Bao
Xue Yao
D. Fang
Shaochong Zhang
Lan Ma
ViTMedIm
282
72
0
03 Jan 2022
AdaFocus V2: End-to-End Training of Spatial Dynamic Networks for Video
  Recognition
AdaFocus V2: End-to-End Training of Spatial Dynamic Networks for Video RecognitionComputer Vision and Pattern Recognition (CVPR), 2021
Yulin Wang
Yang Yue
Yuanze Lin
Haojun Jiang
Zihang Lai
V. Kulikov
Nikita Orlov
Humphrey Shi
Gao Huang
181
62
0
28 Dec 2021
MPViT: Multi-Path Vision Transformer for Dense Prediction
MPViT: Multi-Path Vision Transformer for Dense PredictionComputer Vision and Pattern Recognition (CVPR), 2021
Youngwan Lee
Jonghee Kim
Jeffrey Willette
Sung Ju Hwang
ViT
270
315
0
21 Dec 2021
LocFormer: Enabling Transformers to Perform Temporal Moment Localization
  on Long Untrimmed Videos With a Feature Sampling Approach
LocFormer: Enabling Transformers to Perform Temporal Moment Localization on Long Untrimmed Videos With a Feature Sampling Approach
Cristian Rodriguez-Opazo
Edison Marrese-Taylor
Basura Fernando
Hiroya Takamura
Qi Wu
ViT
169
3
0
19 Dec 2021
A Simple Single-Scale Vision Transformer for Object Localization and
  Instance Segmentation
A Simple Single-Scale Vision Transformer for Object Localization and Instance SegmentationEuropean Conference on Computer Vision (ECCV), 2021
Wuyang Chen
Xianzhi Du
Fan Yang
Lucas Beyer
Xiaohua Zhai
...
Huizhong Chen
Jing Li
Xiaodan Song
Zinan Lin
Denny Zhou
ViT
197
29
0
17 Dec 2021
Distillation of Human-Object Interaction Contexts for Action Recognition
Distillation of Human-Object Interaction Contexts for Action Recognition
Muna Almushyti
Frederick W. Li
250
4
0
17 Dec 2021
Masked Feature Prediction for Self-Supervised Visual Pre-Training
Masked Feature Prediction for Self-Supervised Visual Pre-Training
Chen Wei
Haoqi Fan
Saining Xie
Chaoxia Wu
Alan Yuille
Christoph Feichtenhofer
ViT
433
779
0
16 Dec 2021
SeqFormer: Sequential Transformer for Video Instance Segmentation
SeqFormer: Sequential Transformer for Video Instance Segmentation
Junfeng Wu
Yi Jiang
S. Bai
Wenqing Zhang
Xiang Bai
ViT
190
132
0
15 Dec 2021
Vision Transformer Based Video Hashing Retrieval for Tracing the Source
  of Fake Videos
Vision Transformer Based Video Hashing Retrieval for Tracing the Source of Fake Videos
Pengfei Pei
Xianfeng Zhao
Yun Cao
Jinchuan Li
Xiaowei Yi
ViT
220
9
0
15 Dec 2021
Co-training Transformer with Videos and Images Improves Action
  Recognition
Co-training Transformer with Videos and Images Improves Action Recognition
Bowen Zhang
Jiahui Yu
Christopher Fifty
Wei Han
Andrew M. Dai
Ruoming Pang
Fei Sha
ViT
140
62
0
14 Dec 2021
Translating Human Mobility Forecasting through Natural Language
  Generation
Translating Human Mobility Forecasting through Natural Language Generation
Hao Xue
Flora D. Salim
Yongli Ren
C. Clarke
AI4TS
119
25
0
13 Dec 2021
Learning Generalizable Vision-Tactile Robotic Grasping Strategy for
  Deformable Objects via Transformer
Learning Generalizable Vision-Tactile Robotic Grasping Strategy for Deformable Objects via Transformer
Yunhai Han
Kelin Yu
Rahul Batra
Nathan Boyd
Chaitanya Mehta
T. Zhao
Y. She
S. Hutchinson
Ye Zhao
ViT
401
67
0
13 Dec 2021
DualFormer: Local-Global Stratified Transformer for Efficient Video
  Recognition
DualFormer: Local-Global Stratified Transformer for Efficient Video Recognition
Yuxuan Liang
Pan Zhou
Roger Zimmermann
Shuicheng Yan
ViT
168
24
0
09 Dec 2021
MASTAF: A Model-Agnostic Spatio-Temporal Attention Fusion Network for
  Few-shot Video Classification
MASTAF: A Model-Agnostic Spatio-Temporal Attention Fusion Network for Few-shot Video Classification
Rex Liu
Huan Zhang
Hamed Pirsiavash
Xin Liu
ViT
232
16
0
08 Dec 2021
FLAVA: A Foundational Language And Vision Alignment Model
FLAVA: A Foundational Language And Vision Alignment Model
Amanpreet Singh
Ronghang Hu
Vedanuj Goswami
Guillaume Couairon
Wojciech Galuba
Marcus Rohrbach
Douwe Kiela
CLIPVLM
320
850
0
08 Dec 2021
Previous
123...2324252627
Next