v1v2 (latest)

ViViT: A Video Vision Transformer

IEEE International Conference on Computer Vision (ICCV), 2021

29 March 2021

ArXiv (abs)PDF HTML HuggingFace (3 upvotes)Github (3544★)

Papers citing "ViViT: A Video Vision Transformer"

50 / 1,299 papers shown

Title
Towards an Effective Action-Region Tracking Framework for Fine-grained Video Action Recognition Baoli Sun Y. X. R. Wang Xinzhu Ma Zhihui Wang Kun Lu Zhiyong Wang 146 0 0 26 Nov 2025
Smooth regularization for efficient video recognition Gil Goldman Raja Giryes Mahadev Satyanarayanan AI4TS 163 0 0 25 Nov 2025
VideoCompressa: Data-Efficient Video Understanding via Joint Temporal Compression and Spatial Reconstruction Shaobo Wang Tianle Niu Runkang Yang Deshan Liu Xu He Zichen Wen Conghui He Xuming Hu Linfeng Zhang VGen 166 0 0 24 Nov 2025
Consolidating Diffusion-Generated Video Detection with Unified Multimodal Forgery Learning Xiaohong Liu Xiufeng Song Huayu Zheng Lei Bai Xiaoming Liu Guangtao Zhai DiffM 104 0 0 22 Nov 2025
DeepCoT: Deep Continual Transformers for Real-Time Inference on Data Streams Ginés Carreto Picón Peng Yuan Zhou Qi Zhang Alexandros Iosifidis AI4TS 164 0 0 21 Nov 2025
Flow and Depth Assisted Video Prediction with Latent Transformer Eliyas Suleyman Paul Henderson Eksan Firkat Nicolas Pugeault 74 0 0 20 Nov 2025
RB-FT: Rationale-Bootstrapped Fine-Tuning for Video Classification Meilong Xu Di Fu Jiaxing Zhang Gong Yu Jiayu Zheng Xiaoling Hu Dongdi Zhao Feiyang Li Chao Chen Yong Cao 65 0 0 19 Nov 2025
End to End AI System for Surgical Gesture Sequence Recognition and Clinical Outcome Prediction Xi Li Nicholas Matsumoto Ujjwal Pasupulety Atharva Deo Cherine Yang ... Jeanine Kim Alvin C. Goh C. Wagner Geoffrey A. Sonn A. Hung 58 0 0 14 Nov 2025
RodEpil: A Video Dataset of Laboratory Rodents for Seizure Detection and Benchmark Evaluation Daniele Perlo Vladimir Despotovic Selma Boudissa Sang-Yoon Kim P. V. Nazarov Yanrong Zhang Max Wintermark O. Keunen 56 0 0 13 Nov 2025
Auto-US: An Ultrasound Video Diagnosis Agent Using Video Classification Framework and LLMs Yuezhe Yang Yiyue Guo Wenjie Cai Qingqing Ruan Siying Wang Xingbo Dong Zhe Jin Yong Dai 100 0 0 11 Nov 2025
Modulo Video Recovery via Selective Spatiotemporal Vision TransformerIEEE International Joint Conference on Neural Network (IJCNN), 2025 Tianyu Geng Feng Ji Wee Peng Tay 49 0 0 09 Nov 2025
Grounding Foundational Vision Models with 3D Human Poses for Robust Action Recognition Nicholas Babey Tiffany Gu Yiheng Li Cristian Meo Kevin Zhu 56 0 0 06 Nov 2025
SurgAnt-ViVQA: Learning to Anticipate Surgical Events through GRU-Driven Temporal Cross-Attention Shreyas C. Dhake Jiayuan Huang Runlong He Danyal Z. Khan E. Mazomenos Sophia Bano Hani J. Marcus Danail Stoyanov Matthew J. Clarkson Mobarak I. Hoque 48 0 0 05 Nov 2025
SurgViVQA: Temporally-Grounded Video Question Answering for Surgical Scene Understanding Mauro Orazio Drago Luca Carlini Pelinsu Celebi Balyemez Dennis Pierantozzi Chiara Lena Cesare Hassan Danail Stoyanov Elena De Momi Sophia Bano Mobarak I. Hoque 128 0 0 05 Nov 2025
Dynamic Reflections: Probing Video Representations with Text Alignment Tyler Zhu Tengda Han Leonidas Guibas Viorica Patraucean M. Ovsjanikov VGen 217 0 0 04 Nov 2025
Anatomically Constrained Transformers for Echocardiogram Analysis A. Thorley A. Chartsias Jordan Strom Jeremy Slivnick D. Kotecha Alberto Gómez Yanfu Zhang ViT MedIm 84 0 0 02 Nov 2025
Enhancing Spatio-Temporal Zero-shot Action Recognition with Language-driven Description AttributesPattern Recognition (Pattern Recogn.), 2025 Yehna Kim Y. Kim Seong-Whan Lee VLM 83 0 0 31 Oct 2025
Transformers from Compressed Representations Juan Carlos León Alcázar Mattia Soldan Mohammad Saatialsoruji Alejandro Pardo Hani Itani Juan C. Pérez Bernard Ghanem 108 0 0 26 Oct 2025
AI-Driven Carbon Monitoring: Transformer-Based Reconstruction of Atmospheric CO2 in Canadian Poultry Regions Padmanabhan Jagannathan Prajesh Kaliaperumal Ragunath Miriam Gordon Bruce Rathgeber Suresh Neethirajan 16 0 0 26 Oct 2025
Relieving the Over-Aggregating Effect in Graph Transformers Junshu Sun Wanxing Chang Chenxue Yang Qingming Huang Shuhui Wang 104 0 0 24 Oct 2025
Towards Objective Obstetric Ultrasound Assessment: Contrastive Representation Learning for Fetal Movement Detection Talha Ilyas Duong Nhu Allison Thomas Arie Levin Lim Wei Yap ... Maya Reddy Euan Wallace Wenlong Cheng Zongyuan Ge Faezeh Marzbanrad 74 0 0 23 Oct 2025
X-Ego: Acquiring Team-Level Tactical Situational Awareness via Cross-Egocentric Contrastive Video Representation Learning Yunzhe Wang Soham Hans Volkan Ustun EgoV 138 0 0 22 Oct 2025
A Renaissance of Explicit Motion Information Mining from Transformers for Action Recognition Peiqin Zhuang Wenlong Zhang Yichao Wu Ding Liang Luping Zhou Yali Wang Wanli Ouyang 119 0 0 21 Oct 2025
Chronos-2: From Univariate to Universal Forecasting Abdul Fatir Ansari Oleksandr Shchur Jaris Küken Andreas Auer Boran Han ... Hao Wang Huzefa Rangwala George Karypis Yuyang Wang Michael Bohlke-Schneider AI4TS BDL 129 6 0 17 Oct 2025
DGME-T: Directional Grid Motion Encoding for Transformer-Based Historical Camera Movement Classification Tingyu Lin Armin Dadras Florian Kleber Robert Sablatnig VGen 49 0 0 17 Oct 2025
StretchySnake: Flexible SSM Training Unlocks Action Recognition Across Spatio-Temporal Scales Nyle Siddiqui Rohit Gupta S. Swetha Mubarak Shah 132 0 0 17 Oct 2025
VisCoP: Visual Probing for Video Domain Adaptation of Vision Language Models Dominick Reilly Manish Kumar Govind Le Xue Srijan Das VLM 84 0 0 15 Oct 2025
State Space Prompting via Gathering and Spreading Spatio-Temporal Information for Video Understanding Jiahuan Zhou Kai Zhu Zhenyu Cui Zichen Liu Xu Zou Gang Hua 68 1 0 14 Oct 2025
SpikePool: Event-driven Spiking Transformer with Pooling Attention Donghyun Lee Alex Sima Yuhang Li Panos Stinis Priyadarshini Panda 64 0 0 14 Oct 2025
DAWP: A framework for global observation forecasting via Data Assimilation and Weather Prediction in satellite observation space Junchao Gong Jingyi Xu Ben Fei Zhangrui Li W. Zhang Kun Chen Wanghan Xu Weidong Yang Xiaokang Yang Lei Bai 88 0 0 13 Oct 2025
Structured Spectral Graph Representation Learning for Multi-label Abnormality Analysis from 3D CT Scans Theo Di Piazza Carole Lazarus O. Nempont L. Boussel 157 0 0 12 Oct 2025
TARO: Toward Semantically Rich Open-World Object Detection Yuchen Zhang Yao Lu Johannes Betz 33 0 0 10 Oct 2025
Federated Learning for Surgical Vision in Appendicitis Classification: Results of the FedSurg EndoVis 2024 Challenge Max Kirchner Hanna Hoffmann Alexander C. Jenke Oliver Saldanha Kevin Pfeiffer ... Danail Stoyanov Jakob N. Kather Fiona Kolbinger S. Bodenstedt Stefanie Speidel FedML 133 0 0 06 Oct 2025
Latent Uncertainty Representations for Video-based Driver Action and Intention Recognition Koen Vellenga H. Steinhauer Jonas Andersson Anders Sjögren 103 0 0 06 Oct 2025
SALSA-V: Shortcut-Augmented Long-form Synchronized Audio from Videos Amir Dellali Luca A. Lanzendörfer Florian Grötschla Roger Wattenhofer VGen 84 0 0 03 Oct 2025
Brain-Language Model Alignment: Insights into the Platonic Hypothesis and Intermediate-Layer Advantage Angela Lopez-Cardona Sebastian Idesis Mireia Masias Bruns Sergi Abadal Ioannis Arapakis 84 0 0 03 Oct 2025
Visual Odometry with Transformers V. Yugay Duy-Kien Nguyen Theo Gevers Cees G. M. Snoek Martin R. Oswald 68 1 0 02 Oct 2025
Stacked Regression using Off-the-shelf, Stimulus-tuned and Fine-tuned Neural Networks for Predicting fMRI Brain Responses to Movies (Algonauts 2025 Report) Robert Scholz Kunal Bagga Christine Ahrends Carlo Alberto Barbano 72 0 0 02 Oct 2025
Aligning Video Models with Human Social Judgments via Behavior-Guided Fine-Tuning Kathy Garcia Leyla Isik 88 0 0 01 Oct 2025
POVQA: Preference-Optimized Video Question Answering with Rationales for Data Efficiency Ashim Dahal Ankit Ghimire Saydul Akbar Murad Nick Rahimi 98 0 0 01 Oct 2025
VRWKV-Editor: Reducing quadratic complexity in transformer-based video editing Abdelilah Aitrouga Youssef Hmamouche Amal El Fallah Seghrouchni VGen 113 0 0 30 Sep 2025
UniVid: The Open-Source Unified Video Model Jiabin Luo Junhui Lin Zeyu Zhang Biao Wu Meng Fang Ling-Hao Chen Hao Tang VGen 206 6 0 29 Sep 2025
FrameMind: Frame-Interleaved Video Reasoning via Reinforcement Learning Haonan Ge Yiwei Wang Kai-Wei Chang Hang Wu Yujun Cai LRM 152 0 0 28 Sep 2025
Temporal vs. Spatial: Comparing DINOv3 and V-JEPA2 Feature Representations for Video Action Analysis Sai Varun Kodathala Rakesh Vunnam 76 0 0 25 Sep 2025
Anatomically Constrained Transformers for Cardiac Amyloidosis Classification Alexander Thorley A. Chartsias Jordan Strom Roberto Lang Jeremy Slivnick Jamie O'Driscoll Rajan Sharma D. Kotecha Yanfu Zhang A. Gómez ViT MedIm 84 1 0 24 Sep 2025
MsFIN: Multi-scale Feature Interaction Network for Traffic Accident Anticipation Tongshuai Wu Chao Lu Ze Song Yunlong Lin Sizhe Fan Xuemei Chen 64 0 0 23 Sep 2025
History-Aware Visuomotor Policy Learning via Point Tracking Jingjing Chen Hongjie Fang Chenxi Wang Shiquan Wang Cewu Lu 128 1 0 21 Sep 2025
MoCLIP-Lite: Efficient Video Recognition by Fusing CLIP with Motion Vectors Binhua Huang Nan Wang Arjun Parakash Soumyabrata Dev CLIP VLM 73 0 0 21 Sep 2025
FakeChain: Exposing Shallow Cues in Multi-Step Deepfake Detection Minji Heo Simon S. Woo 129 1 0 20 Sep 2025
SolarCrossFormer: Improving day-ahead Solar Irradiance Forecasting by Integrating Satellite Imagery and Ground SensorsBrazilian Symposium on Games and Digital Entertainment (SBGames), 2025 B. Schubnel J. Simeunović Corentin Tissier P. Alet R. Carrillo BDL 326 0 0 19 Sep 2025