ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2103.15691
  4. Cited By
ViViT: A Video Vision Transformer
v1v2 (latest)

ViViT: A Video Vision Transformer

IEEE International Conference on Computer Vision (ICCV), 2021
29 March 2021
Anurag Arnab
Mostafa Dehghani
G. Heigold
Chen Sun
Mario Lucic
Cordelia Schmid
    ViT
ArXiv (abs)PDFHTMLHuggingFace (3 upvotes)Github (3544★)

Papers citing "ViViT: A Video Vision Transformer"

50 / 1,299 papers shown
Title
Seurat: From Moving Points to Depth
Seurat: From Moving Points to DepthComputer Vision and Pattern Recognition (CVPR), 2025
Seokju Cho
Jiahui Huang
S. Kim
Joon-Young Lee
3DPCMDE
238
8
0
20 Apr 2025
Video-MMLU: A Massive Multi-Discipline Lecture Understanding Benchmark
Video-MMLU: A Massive Multi-Discipline Lecture Understanding Benchmark
Enxin Song
Wenhao Chai
Weili Xu
Jianwen Xie
Yuxuan Liu
Gaoang Wang
341
19
0
20 Apr 2025
Advancing Video Anomaly Detection: A Bi-Directional Hybrid Framework for Enhanced Single- and Multi-Task Approaches
Advancing Video Anomaly Detection: A Bi-Directional Hybrid Framework for Enhanced Single- and Multi-Task ApproachesIEEE Transactions on Image Processing (TIP), 2024
Guodong Shen
Yuqi Ouyang
Junru Lu
Yixuan Yang
Victor Sanchez
406
4
0
20 Apr 2025
PCBEAR: Pose Concept Bottleneck for Explainable Action Recognition
PCBEAR: Pose Concept Bottleneck for Explainable Action Recognition
Jongseo Lee
Wooil Lee
Gyeong-Moon Park
Seong Tae Kim
Jinwoo Choi
337
1
0
17 Apr 2025
Breaking the Barriers: Video Vision Transformers for Word-Level Sign Language Recognition
Breaking the Barriers: Video Vision Transformers for Word-Level Sign Language Recognition
Alexander Brettmann
Jakob Grävinghoff
Marlene Rüschoff
Marie Westhues
SLR
213
1
0
10 Apr 2025
Extending Visual Dynamics for Video-to-Music Generation
Extending Visual Dynamics for Video-to-Music Generation
Xiaohao Liu
Teng Tu
Yunshan Ma
Tat-Seng Chua
VGen
209
1
0
10 Apr 2025
Deep Learning for Cardiovascular Risk Assessment: Proxy Features from Carotid Sonography as Predictors of Arterial Damage
Deep Learning for Cardiovascular Risk Assessment: Proxy Features from Carotid Sonography as Predictors of Arterial DamageAnnual Conference on Medical Image Understanding and Analysis (MIUA), 2025
Christoph Balada
Aida Romano-Martinez
Vincent ten Cate
Katharina Geschke
Jonas Tesarz
...
Dativa Tibyampansha
Karl-Patrik Kresoja
Philipp S. Wild
Sheraz Ahmed
Andreas Dengel
129
0
0
09 Apr 2025
SEVERE++: Evaluating Benchmark Sensitivity in Generalization of Video Representation Learning
SEVERE++: Evaluating Benchmark Sensitivity in Generalization of Video Representation Learning
Fida Mohammad Thoker
Letian Jiang
Chen Zhao
Piyush Bagad
Hazel Doughty
Bernard Ghanem
Cees G. M. Snoek
ViTSSL
271
0
0
08 Apr 2025
A Large-Scale Analysis on Contextual Self-Supervised Video Representation Learning
A Large-Scale Analysis on Contextual Self-Supervised Video Representation Learning
Akash Kumar
Ashlesha Kumar
Vibhav Vineet
Yogesh S Rawat
SSL
843
3
0
08 Apr 2025
Balancing long- and short-term dynamics for the modeling of saliency in videos
Balancing long- and short-term dynamics for the modeling of saliency in videos
Theodor Wulff
Fares Abawi
Philipp Allgeuer
Stefan Wermter
140
0
0
08 Apr 2025
Safety Modulation: Enhancing Safety in Reinforcement Learning through Cost-Modulated Rewards
Safety Modulation: Enhancing Safety in Reinforcement Learning through Cost-Modulated Rewards
Hanping Zhang
Yuhong Guo
OffRL
240
1
0
03 Apr 2025
MultiTSF: Transformer-based Sensor Fusion for Human-Centric Multi-view and Multi-modal Action Recognition
MultiTSF: Transformer-based Sensor Fusion for Human-Centric Multi-view and Multi-modal Action Recognition
Trung Thanh Nguyen
Yasutomo Kawanishi
Vijay John
Takahiro Komamizu
Ichiro Ide
ViT
246
2
0
03 Apr 2025
Multifaceted Evaluation of Audio-Visual Capability for MLLMs: Effectiveness, Efficiency, Generalizability and Robustness
Multifaceted Evaluation of Audio-Visual Capability for MLLMs: Effectiveness, Efficiency, Generalizability and Robustness
Yusheng Zhao
Junyu Luo
Zhiyuan Ning
Weizhi Zhang
Zhiping Xiao
Wei Ju
Philip S. Yu
Ming Zhang
AuLLM
249
0
0
03 Apr 2025
MultiSensor-Home: A Wide-area Multi-modal Multi-view Dataset for Action Recognition and Transformer-based Sensor Fusion
MultiSensor-Home: A Wide-area Multi-modal Multi-view Dataset for Action Recognition and Transformer-based Sensor FusionIEEE International Conference on Automatic Face & Gesture Recognition (FG), 2025
Trung Thanh Nguyen
Yasutomo Kawanishi
Vijay John
Takahiro Komamizu
Ichiro Ide
371
1
0
03 Apr 2025
Is Temporal Prompting All We Need For Limited Labeled Action Recognition?
Is Temporal Prompting All We Need For Limited Labeled Action Recognition?
Shreyank N. Gowda
Boyan Gao
Xiao Gu
Xiaobo Jin
VLM
275
0
0
02 Apr 2025
Shot-by-Shot: Film-Grammar-Aware Training-Free Audio Description Generation
Shot-by-Shot: Film-Grammar-Aware Training-Free Audio Description Generation
Junyu Xie
Tengda Han
Max Bain
Arsha Nagrani
Eshika Khandelwal
Gül Varol
Weidi Xie
Andrew Zisserman
DiffMVGen
348
3
0
01 Apr 2025
A Survey on Music Generation from Single-Modal, Cross-Modal, and Multi-Modal Perspectives
A Survey on Music Generation from Single-Modal, Cross-Modal, and Multi-Modal Perspectives
Shuyu Li
Shulei Ji
Zihao Wang
Songruoyao Wu
Jiaxing Yu
Jianchao Tan
MGenVGen
467
2
0
01 Apr 2025
Chapter-Llama: Efficient Chaptering in Hour-Long Videos with LLMs
Chapter-Llama: Efficient Chaptering in Hour-Long Videos with LLMsComputer Vision and Pattern Recognition (CVPR), 2025
Lucas Ventura
Antoine Yang
Cordelia Schmid
Gül Varol
222
1
0
31 Mar 2025
CA^2ST: Cross-Attention in Audio, Space, and Time for Holistic Video Recognition
CA^2ST: Cross-Attention in Audio, Space, and Time for Holistic Video Recognition
Jongseo Lee
Joohyun Chang
Dongho Lee
Jinwoo Choi
424
0
0
30 Mar 2025
Comparative Analysis of Image, Video, and Audio Classifiers for Automated News Video Segmentation
Comparative Analysis of Image, Video, and Audio Classifiers for Automated News Video SegmentationConference on Algebraic Informatics (AI), 2025
Jonathan Attard
Dylan Seychell
188
0
0
27 Mar 2025
Vision-to-Music Generation: A Survey
Vision-to-Music Generation: A Survey
Zhaokai Wang
Chenxi Bao
Le Zhuo
Jingrui Han
Yang Yue
Yihong Tang
Victor Shea-Jay Huang
Yue Liao
EGVMVGen
302
3
0
27 Mar 2025
Mamba-3D as Masked Autoencoders for Accurate and Data-Efficient Analysis of Medical Ultrasound Videos
Mamba-3D as Masked Autoencoders for Accurate and Data-Efficient Analysis of Medical Ultrasound Videos
Jiaheng Zhou
Yanfeng Zhou
Wei Fang
Yuxing Tang
Le Lu
Ge Yang
Mamba
932
0
0
26 Mar 2025
Tracktention: Leveraging Point Tracking to Attend Videos Faster and Better
Tracktention: Leveraging Point Tracking to Attend Videos Faster and BetterComputer Vision and Pattern Recognition (CVPR), 2025
Zihang Lai
Andrea Vedaldi
173
3
0
25 Mar 2025
ATARS: An Aerial Traffic Atomic Activity Recognition and Temporal Segmentation Dataset
ATARS: An Aerial Traffic Atomic Activity Recognition and Temporal Segmentation Dataset
Zihao Chen
Hsuanyu Wu
Chi-Hsi Kung
Yi-Ting Chen
Yan-Tsung Peng
209
1
0
24 Mar 2025
VTD-CLIP: Video-to-Text Discretization via Prompting CLIP
VTD-CLIP: Video-to-Text Discretization via Prompting CLIP
Wencheng Zhu
Yuexin Wang
Hongxuan Li
Pengfei Zhu
Q. Hu
CLIP
298
0
0
24 Mar 2025
Context-Enhanced Memory-Refined Transformer for Online Action Detection
Context-Enhanced Memory-Refined Transformer for Online Action DetectionComputer Vision and Pattern Recognition (CVPR), 2025
Zhanzhong Pang
Fadime Sener
Angela Yao
OffRL
248
4
0
24 Mar 2025
TruthLens: Visual Grounding for Universal DeepFake Reasoning
TruthLens: Visual Grounding for Universal DeepFake Reasoning
Rohit Kundu
Shan Jia
Vishal Mohanty
Athula Balachandran
Amit K. Roy-Chowdhury
344
3
0
20 Mar 2025
Action tube generation by person query matching for spatio-temporal action detection
Action tube generation by person query matching for spatio-temporal action detection
Kazuki Omi
Jion Oshima
Toru Tamaki
307
0
0
17 Mar 2025
Quantum EigenGame for excited state calculation
Quantum EigenGame for excited state calculation
David Quiroga
Jason Han
Anastasios Kyrillidis
228
4
0
17 Mar 2025
Long-VMNet: Accelerating Long-Form Video Understanding via Fixed Memory
Long-VMNet: Accelerating Long-Form Video Understanding via Fixed Memory
Saket Gurukar
Asim Kadav
VLM
305
1
0
17 Mar 2025
Towards Scalable Modeling of Compressed Videos for Efficient Action Recognition
Towards Scalable Modeling of Compressed Videos for Efficient Action Recognition
Shristi Das Biswas
Efstathia Soufleri
Arani Roy
Kaushik Roy
246
1
0
17 Mar 2025
VideoMAP: Toward Scalable Mamba-based Video Autoregressive Pretraining
VideoMAP: Toward Scalable Mamba-based Video Autoregressive Pretraining
Yunze Liu
Peiran Wu
C. Liang
Junxiao Shen
Limin Wang
Li Yi
Mamba
304
2
0
16 Mar 2025
Real-Time Manipulation Action Recognition with a Factorized Graph Sequence Encoder
Real-Time Manipulation Action Recognition with a Factorized Graph Sequence Encoder
Enes Erdogan
E. Aksoy
Sanem Sariel
238
0
0
15 Mar 2025
TransiT: Transient Transformer for Non-line-of-sight Videography
Ruiqian Li
Siyuan Shen
Suan Xia
Zehao Wang
Xingyue Peng
Chengxuan Song
Yingsheng Zhu
Tao Wu
Shiying Li
Jingyi Yu
182
0
0
14 Mar 2025
VGGT: Visual Geometry Grounded TransformerComputer Vision and Pattern Recognition (CVPR), 2025
Jianyuan Wang
Minghao Chen
Nikita Karaev
Andrea Vedaldi
Christian Rupprecht
David Novotny
ViT
379
408
0
14 Mar 2025
NIL: No-data Imitation Learning by Leveraging Pre-trained Video Diffusion Models
Mert Albaba
Chenhao Li
Markos Diomataris
Omid Taheri
Andreas Krause
M. Black
VGen
214
6
0
13 Mar 2025
Robustness Tokens: Towards Adversarial Robustness of TransformersEuropean Conference on Computer Vision (ECCV), 2025
Brian Pulfer
Yury Belousov
S. Voloshynovskiy
AAML
198
0
0
13 Mar 2025
Towards Fast, Memory-based and Data-Efficient Vision-Language Policy
Haoxuan Li
Sixu Yan
Yongqian Li
Xinggang Wang
LM&Ro
284
2
0
13 Mar 2025
A Survey on Knowledge-Oriented Retrieval-Augmented Generation
A Survey on Knowledge-Oriented Retrieval-Augmented Generation
Mingyue Cheng
Yucong Luo
Jie Ouyang
Qiang Liu
Huijie Liu
...
Bohou Zhang
Jiawei Cao
Jie Ma
Daoyu Wang
Tong Xu
3DV
334
31
0
11 Mar 2025
An Optimization Algorithm for Multimodal Data Alignment
Wei Zhang
Xinyu Wang
Lan Yu
S. Li
122
0
0
05 Mar 2025
STAA-SNN: Spatial-Temporal Attention Aggregator for Spiking Neural Networks
STAA-SNN: Spatial-Temporal Attention Aggregator for Spiking Neural NetworksComputer Vision and Pattern Recognition (CVPR), 2025
Tianqing Zhang
Kairong Yu
Xian Zhong
Hongwei Wang
Qi Xu
Qiang Zhang
284
6
0
04 Mar 2025
Semi-Supervised Audio-Visual Video Action Recognition with Audio Source Localization Guided Mixup
Seokun Kang
Taehwan Kim
212
0
0
04 Mar 2025
Anatomically-guided masked autoencoder pre-training for aneurysm detection
Anatomically-guided masked autoencoder pre-training for aneurysm detection
Alberto Mario Ceballos-Arroyo
Jisoo Kim
Hongpeng Zhou
Lei Qin
Geoffrey S. Young
Huaizu Jiang
ViTMedIm
137
0
0
28 Feb 2025
Revisiting Kernel Attention with Correlated Gaussian Process Representation
Revisiting Kernel Attention with Correlated Gaussian Process RepresentationConference on Uncertainty in Artificial Intelligence (UAI), 2025
Long Minh Bui
Tho Tran Huu
Duy-Tung Dinh
T. Nguyen
Trong Nghia Hoang
304
5
0
27 Feb 2025
Spectral-Enhanced Transformers: Leveraging Large-Scale Pretrained Models for Hyperspectral Object Tracking
Spectral-Enhanced Transformers: Leveraging Large-Scale Pretrained Models for Hyperspectral Object TrackingWorkshop on Hyperspectral Image and Signal Processing (WHISPERS), 2024
Shaheer Mohamed
Tharindu Fernando
Sridha Sridharan
Peyman Moghadam
Clinton Fookes
ViT
375
1
0
26 Feb 2025
Looped ReLU MLPs May Be All You Need as Practical Programmable Computers
Looped ReLU MLPs May Be All You Need as Practical Programmable ComputersInternational Conference on Artificial Intelligence and Statistics (AISTATS), 2024
Yingyu Liang
Zhizhou Sha
Zhenmei Shi
Zhao Song
Yufa Zhou
532
22
0
21 Feb 2025
RhythmFormer: Extracting Patterned rPPG Signals based on Periodic Sparse Attention
RhythmFormer: Extracting Patterned rPPG Signals based on Periodic Sparse AttentionPattern Recognition (Pattern Recogn.), 2024
Bochao Zou
Zizheng Guo
Jiansheng Chen
Junbao Zhuo
Weiran Huang
Huimin Ma
ViTAI4TS
309
1
0
21 Feb 2025
MotionMatcher: Motion Customization of Text-to-Video Diffusion Models via Motion Feature Matching
MotionMatcher: Motion Customization of Text-to-Video Diffusion Models via Motion Feature Matching
Yen-Siang Wu
Chi-Pin Huang
Fu-En Yang
Yu-Jie Wang
DiffMVGen
235
2
0
18 Feb 2025
Improving action segmentation via explicit similarity measurement
Improving action segmentation via explicit similarity measurement
Kamel Aouaidjia
Wenhao Zhang
Aofan Li
Chongsheng Zhang
230
0
0
15 Feb 2025
Enhancing Video Understanding: Deep Neural Networks for Spatiotemporal Analysis
Enhancing Video Understanding: Deep Neural Networks for Spatiotemporal Analysis
Amir Hosein Fadaei
M. Dehaqani
285
0
0
11 Feb 2025
Previous
12345...242526
Next