ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2103.15691
  4. Cited By
ViViT: A Video Vision Transformer
v1v2 (latest)

ViViT: A Video Vision Transformer

IEEE International Conference on Computer Vision (ICCV), 2021
29 March 2021
Anurag Arnab
Mostafa Dehghani
G. Heigold
Chen Sun
Mario Lucic
Cordelia Schmid
    ViT
ArXiv (abs)PDFHTMLHuggingFace (3 upvotes)Github (3544★)

Papers citing "ViViT: A Video Vision Transformer"

50 / 1,299 papers shown
Title
Towards an Effective Action-Region Tracking Framework for Fine-grained Video Action Recognition
Towards an Effective Action-Region Tracking Framework for Fine-grained Video Action Recognition
Baoli Sun
Y. X. R. Wang
Xinzhu Ma
Zhihui Wang
Kun Lu
Zhiyong Wang
146
0
0
26 Nov 2025
Smooth regularization for efficient video recognition
Smooth regularization for efficient video recognition
Gil Goldman
Raja Giryes
Mahadev Satyanarayanan
AI4TS
163
0
0
25 Nov 2025
VideoCompressa: Data-Efficient Video Understanding via Joint Temporal Compression and Spatial Reconstruction
VideoCompressa: Data-Efficient Video Understanding via Joint Temporal Compression and Spatial Reconstruction
Shaobo Wang
Tianle Niu
Runkang Yang
Deshan Liu
Xu He
Zichen Wen
Conghui He
Xuming Hu
Linfeng Zhang
VGen
166
0
0
24 Nov 2025
Consolidating Diffusion-Generated Video Detection with Unified Multimodal Forgery Learning
Consolidating Diffusion-Generated Video Detection with Unified Multimodal Forgery Learning
Xiaohong Liu
Xiufeng Song
Huayu Zheng
Lei Bai
Xiaoming Liu
Guangtao Zhai
DiffM
104
0
0
22 Nov 2025
DeepCoT: Deep Continual Transformers for Real-Time Inference on Data Streams
DeepCoT: Deep Continual Transformers for Real-Time Inference on Data Streams
Ginés Carreto Picón
Peng Yuan Zhou
Qi Zhang
Alexandros Iosifidis
AI4TS
164
0
0
21 Nov 2025
Flow and Depth Assisted Video Prediction with Latent Transformer
Eliyas Suleyman
Paul Henderson
Eksan Firkat
Nicolas Pugeault
74
0
0
20 Nov 2025
RB-FT: Rationale-Bootstrapped Fine-Tuning for Video Classification
Meilong Xu
Di Fu
Jiaxing Zhang
Gong Yu
Jiayu Zheng
Xiaoling Hu
Dongdi Zhao
Feiyang Li
Chao Chen
Yong Cao
65
0
0
19 Nov 2025
End to End AI System for Surgical Gesture Sequence Recognition and Clinical Outcome Prediction
End to End AI System for Surgical Gesture Sequence Recognition and Clinical Outcome Prediction
Xi Li
Nicholas Matsumoto
Ujjwal Pasupulety
Atharva Deo
Cherine Yang
...
Jeanine Kim
Alvin C. Goh
C. Wagner
Geoffrey A. Sonn
A. Hung
58
0
0
14 Nov 2025
RodEpil: A Video Dataset of Laboratory Rodents for Seizure Detection and Benchmark Evaluation
RodEpil: A Video Dataset of Laboratory Rodents for Seizure Detection and Benchmark Evaluation
Daniele Perlo
Vladimir Despotovic
Selma Boudissa
Sang-Yoon Kim
P. V. Nazarov
Yanrong Zhang
Max Wintermark
O. Keunen
56
0
0
13 Nov 2025
Auto-US: An Ultrasound Video Diagnosis Agent Using Video Classification Framework and LLMs
Auto-US: An Ultrasound Video Diagnosis Agent Using Video Classification Framework and LLMs
Yuezhe Yang
Yiyue Guo
Wenjie Cai
Qingqing Ruan
Siying Wang
Xingbo Dong
Zhe Jin
Yong Dai
100
0
0
11 Nov 2025
Modulo Video Recovery via Selective Spatiotemporal Vision Transformer
Modulo Video Recovery via Selective Spatiotemporal Vision TransformerIEEE International Joint Conference on Neural Network (IJCNN), 2025
Tianyu Geng
Feng Ji
Wee Peng Tay
49
0
0
09 Nov 2025
Grounding Foundational Vision Models with 3D Human Poses for Robust Action Recognition
Grounding Foundational Vision Models with 3D Human Poses for Robust Action Recognition
Nicholas Babey
Tiffany Gu
Yiheng Li
Cristian Meo
Kevin Zhu
56
0
0
06 Nov 2025
SurgAnt-ViVQA: Learning to Anticipate Surgical Events through GRU-Driven Temporal Cross-Attention
SurgAnt-ViVQA: Learning to Anticipate Surgical Events through GRU-Driven Temporal Cross-Attention
Shreyas C. Dhake
Jiayuan Huang
Runlong He
Danyal Z. Khan
E. Mazomenos
Sophia Bano
Hani J. Marcus
Danail Stoyanov
Matthew J. Clarkson
Mobarak I. Hoque
48
0
0
05 Nov 2025
SurgViVQA: Temporally-Grounded Video Question Answering for Surgical Scene Understanding
SurgViVQA: Temporally-Grounded Video Question Answering for Surgical Scene Understanding
Mauro Orazio Drago
Luca Carlini
Pelinsu Celebi Balyemez
Dennis Pierantozzi
Chiara Lena
Cesare Hassan
Danail Stoyanov
Elena De Momi
Sophia Bano
Mobarak I. Hoque
128
0
0
05 Nov 2025
Dynamic Reflections: Probing Video Representations with Text Alignment
Dynamic Reflections: Probing Video Representations with Text Alignment
Tyler Zhu
Tengda Han
Leonidas Guibas
Viorica Patraucean
M. Ovsjanikov
VGen
217
0
0
04 Nov 2025
Anatomically Constrained Transformers for Echocardiogram Analysis
Anatomically Constrained Transformers for Echocardiogram Analysis
A. Thorley
A. Chartsias
Jordan Strom
Jeremy Slivnick
D. Kotecha
Alberto Gómez
Yanfu Zhang
ViTMedIm
84
0
0
02 Nov 2025
Enhancing Spatio-Temporal Zero-shot Action Recognition with Language-driven Description Attributes
Enhancing Spatio-Temporal Zero-shot Action Recognition with Language-driven Description AttributesPattern Recognition (Pattern Recogn.), 2025
Yehna Kim
Y. Kim
Seong-Whan Lee
VLM
83
0
0
31 Oct 2025
Transformers from Compressed Representations
Transformers from Compressed Representations
Juan Carlos León Alcázar
Mattia Soldan
Mohammad Saatialsoruji
Alejandro Pardo
Hani Itani
Juan C. Pérez
Bernard Ghanem
108
0
0
26 Oct 2025
AI-Driven Carbon Monitoring: Transformer-Based Reconstruction of Atmospheric CO2 in Canadian Poultry Regions
AI-Driven Carbon Monitoring: Transformer-Based Reconstruction of Atmospheric CO2 in Canadian Poultry Regions
Padmanabhan Jagannathan Prajesh
Kaliaperumal Ragunath
Miriam Gordon
Bruce Rathgeber
Suresh Neethirajan
16
0
0
26 Oct 2025
Relieving the Over-Aggregating Effect in Graph Transformers
Relieving the Over-Aggregating Effect in Graph Transformers
Junshu Sun
Wanxing Chang
Chenxue Yang
Qingming Huang
Shuhui Wang
104
0
0
24 Oct 2025
Towards Objective Obstetric Ultrasound Assessment: Contrastive Representation Learning for Fetal Movement Detection
Towards Objective Obstetric Ultrasound Assessment: Contrastive Representation Learning for Fetal Movement Detection
Talha Ilyas
Duong Nhu
Allison Thomas
Arie Levin
Lim Wei Yap
...
Maya Reddy
Euan Wallace
Wenlong Cheng
Zongyuan Ge
Faezeh Marzbanrad
74
0
0
23 Oct 2025
X-Ego: Acquiring Team-Level Tactical Situational Awareness via Cross-Egocentric Contrastive Video Representation Learning
X-Ego: Acquiring Team-Level Tactical Situational Awareness via Cross-Egocentric Contrastive Video Representation Learning
Yunzhe Wang
Soham Hans
Volkan Ustun
EgoV
138
0
0
22 Oct 2025
A Renaissance of Explicit Motion Information Mining from Transformers for Action Recognition
A Renaissance of Explicit Motion Information Mining from Transformers for Action Recognition
Peiqin Zhuang
Wenlong Zhang
Yichao Wu
Ding Liang
Luping Zhou
Yali Wang
Wanli Ouyang
119
0
0
21 Oct 2025
Chronos-2: From Univariate to Universal Forecasting
Chronos-2: From Univariate to Universal Forecasting
Abdul Fatir Ansari
Oleksandr Shchur
Jaris Küken
Andreas Auer
Boran Han
...
Hao Wang
Huzefa Rangwala
George Karypis
Yuyang Wang
Michael Bohlke-Schneider
AI4TSBDL
129
6
0
17 Oct 2025
DGME-T: Directional Grid Motion Encoding for Transformer-Based Historical Camera Movement Classification
DGME-T: Directional Grid Motion Encoding for Transformer-Based Historical Camera Movement Classification
Tingyu Lin
Armin Dadras
Florian Kleber
Robert Sablatnig
VGen
49
0
0
17 Oct 2025
StretchySnake: Flexible SSM Training Unlocks Action Recognition Across Spatio-Temporal Scales
StretchySnake: Flexible SSM Training Unlocks Action Recognition Across Spatio-Temporal Scales
Nyle Siddiqui
Rohit Gupta
S. Swetha
Mubarak Shah
132
0
0
17 Oct 2025
VisCoP: Visual Probing for Video Domain Adaptation of Vision Language Models
VisCoP: Visual Probing for Video Domain Adaptation of Vision Language Models
Dominick Reilly
Manish Kumar Govind
Le Xue
Srijan Das
VLM
84
0
0
15 Oct 2025
State Space Prompting via Gathering and Spreading Spatio-Temporal Information for Video Understanding
State Space Prompting via Gathering and Spreading Spatio-Temporal Information for Video Understanding
Jiahuan Zhou
Kai Zhu
Zhenyu Cui
Zichen Liu
Xu Zou
Gang Hua
68
1
0
14 Oct 2025
SpikePool: Event-driven Spiking Transformer with Pooling Attention
SpikePool: Event-driven Spiking Transformer with Pooling Attention
Donghyun Lee
Alex Sima
Yuhang Li
Panos Stinis
Priyadarshini Panda
64
0
0
14 Oct 2025
DAWP: A framework for global observation forecasting via Data Assimilation and Weather Prediction in satellite observation space
DAWP: A framework for global observation forecasting via Data Assimilation and Weather Prediction in satellite observation space
Junchao Gong
Jingyi Xu
Ben Fei
Zhangrui Li
W. Zhang
Kun Chen
Wanghan Xu
Weidong Yang
Xiaokang Yang
Lei Bai
88
0
0
13 Oct 2025
Structured Spectral Graph Representation Learning for Multi-label Abnormality Analysis from 3D CT Scans
Structured Spectral Graph Representation Learning for Multi-label Abnormality Analysis from 3D CT Scans
Theo Di Piazza
Carole Lazarus
O. Nempont
L. Boussel
157
0
0
12 Oct 2025
TARO: Toward Semantically Rich Open-World Object Detection
TARO: Toward Semantically Rich Open-World Object Detection
Yuchen Zhang
Yao Lu
Johannes Betz
33
0
0
10 Oct 2025
Federated Learning for Surgical Vision in Appendicitis Classification: Results of the FedSurg EndoVis 2024 Challenge
Federated Learning for Surgical Vision in Appendicitis Classification: Results of the FedSurg EndoVis 2024 Challenge
Max Kirchner
Hanna Hoffmann
Alexander C. Jenke
Oliver Saldanha
Kevin Pfeiffer
...
Danail Stoyanov
Jakob N. Kather
Fiona Kolbinger
S. Bodenstedt
Stefanie Speidel
FedML
133
0
0
06 Oct 2025
Latent Uncertainty Representations for Video-based Driver Action and Intention Recognition
Latent Uncertainty Representations for Video-based Driver Action and Intention Recognition
Koen Vellenga
H. Steinhauer
Jonas Andersson
Anders Sjögren
103
0
0
06 Oct 2025
SALSA-V: Shortcut-Augmented Long-form Synchronized Audio from Videos
SALSA-V: Shortcut-Augmented Long-form Synchronized Audio from Videos
Amir Dellali
Luca A. Lanzendörfer
Florian Grötschla
Roger Wattenhofer
VGen
84
0
0
03 Oct 2025
Brain-Language Model Alignment: Insights into the Platonic Hypothesis and Intermediate-Layer Advantage
Brain-Language Model Alignment: Insights into the Platonic Hypothesis and Intermediate-Layer Advantage
Angela Lopez-Cardona
Sebastian Idesis
Mireia Masias Bruns
Sergi Abadal
Ioannis Arapakis
84
0
0
03 Oct 2025
Visual Odometry with Transformers
Visual Odometry with Transformers
V. Yugay
Duy-Kien Nguyen
Theo Gevers
Cees G. M. Snoek
Martin R. Oswald
68
1
0
02 Oct 2025
Stacked Regression using Off-the-shelf, Stimulus-tuned and Fine-tuned Neural Networks for Predicting fMRI Brain Responses to Movies (Algonauts 2025 Report)
Stacked Regression using Off-the-shelf, Stimulus-tuned and Fine-tuned Neural Networks for Predicting fMRI Brain Responses to Movies (Algonauts 2025 Report)
Robert Scholz
Kunal Bagga
Christine Ahrends
Carlo Alberto Barbano
72
0
0
02 Oct 2025
Aligning Video Models with Human Social Judgments via Behavior-Guided Fine-Tuning
Aligning Video Models with Human Social Judgments via Behavior-Guided Fine-Tuning
Kathy Garcia
Leyla Isik
88
0
0
01 Oct 2025
POVQA: Preference-Optimized Video Question Answering with Rationales for Data Efficiency
POVQA: Preference-Optimized Video Question Answering with Rationales for Data Efficiency
Ashim Dahal
Ankit Ghimire
Saydul Akbar Murad
Nick Rahimi
98
0
0
01 Oct 2025
VRWKV-Editor: Reducing quadratic complexity in transformer-based video editing
VRWKV-Editor: Reducing quadratic complexity in transformer-based video editing
Abdelilah Aitrouga
Youssef Hmamouche
Amal El Fallah Seghrouchni
VGen
113
0
0
30 Sep 2025
UniVid: The Open-Source Unified Video Model
UniVid: The Open-Source Unified Video Model
Jiabin Luo
Junhui Lin
Zeyu Zhang
Biao Wu
Meng Fang
Ling-Hao Chen
Hao Tang
VGen
206
6
0
29 Sep 2025
FrameMind: Frame-Interleaved Video Reasoning via Reinforcement Learning
FrameMind: Frame-Interleaved Video Reasoning via Reinforcement Learning
Haonan Ge
Yiwei Wang
Kai-Wei Chang
Hang Wu
Yujun Cai
LRM
152
0
0
28 Sep 2025
Temporal vs. Spatial: Comparing DINOv3 and V-JEPA2 Feature Representations for Video Action Analysis
Temporal vs. Spatial: Comparing DINOv3 and V-JEPA2 Feature Representations for Video Action Analysis
Sai Varun Kodathala
Rakesh Vunnam
76
0
0
25 Sep 2025
Anatomically Constrained Transformers for Cardiac Amyloidosis Classification
Anatomically Constrained Transformers for Cardiac Amyloidosis Classification
Alexander Thorley
A. Chartsias
Jordan Strom
Roberto Lang
Jeremy Slivnick
Jamie O'Driscoll
Rajan Sharma
D. Kotecha
Yanfu Zhang
A. Gómez
ViTMedIm
84
1
0
24 Sep 2025
MsFIN: Multi-scale Feature Interaction Network for Traffic Accident Anticipation
MsFIN: Multi-scale Feature Interaction Network for Traffic Accident Anticipation
Tongshuai Wu
Chao Lu
Ze Song
Yunlong Lin
Sizhe Fan
Xuemei Chen
64
0
0
23 Sep 2025
History-Aware Visuomotor Policy Learning via Point Tracking
History-Aware Visuomotor Policy Learning via Point Tracking
Jingjing Chen
Hongjie Fang
Chenxi Wang
Shiquan Wang
Cewu Lu
128
1
0
21 Sep 2025
MoCLIP-Lite: Efficient Video Recognition by Fusing CLIP with Motion Vectors
MoCLIP-Lite: Efficient Video Recognition by Fusing CLIP with Motion Vectors
Binhua Huang
Nan Wang
Arjun Parakash
Soumyabrata Dev
CLIPVLM
73
0
0
21 Sep 2025
FakeChain: Exposing Shallow Cues in Multi-Step Deepfake Detection
FakeChain: Exposing Shallow Cues in Multi-Step Deepfake Detection
Minji Heo
Simon S. Woo
129
1
0
20 Sep 2025
SolarCrossFormer: Improving day-ahead Solar Irradiance Forecasting by Integrating Satellite Imagery and Ground Sensors
SolarCrossFormer: Improving day-ahead Solar Irradiance Forecasting by Integrating Satellite Imagery and Ground SensorsBrazilian Symposium on Games and Digital Entertainment (SBGames), 2025
B. Schubnel
J. Simeunović
Corentin Tissier
P. Alet
R. Carrillo
BDL
326
0
0
19 Sep 2025
1234...242526
Next