ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2026 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2103.15691
  4. Cited By
ViViT: A Video Vision Transformer
v1v2 (latest)

ViViT: A Video Vision Transformer

IEEE International Conference on Computer Vision (ICCV), 2021
29 March 2021
Anurag Arnab
Mostafa Dehghani
G. Heigold
Chen Sun
Mario Lucic
Cordelia Schmid
    ViT
ArXiv (abs)PDFHTMLHuggingFace (3 upvotes)Github (3544★)

Papers citing "ViViT: A Video Vision Transformer"

50 / 1,309 papers shown
EEA: Exploration-Exploitation Agent for Long Video Understanding
EEA: Exploration-Exploitation Agent for Long Video Understanding
Te Yang
Xiangyu Zhu
Bo Wang
Quan Chen
Peng Jiang
Zhen Lei
60
0
0
03 Dec 2025
Reasoning-Aware Multimodal Fusion for Hateful Video Detection
Reasoning-Aware Multimodal Fusion for Hateful Video Detection
Shuonan Yang
Tailin Chen
Jiangbei Yue
Guangliang Cheng
Jianbo Jiao
Zeyu Fu
256
0
0
02 Dec 2025
Hear What Matters! Text-conditioned Selective Video-to-Audio Generation
Hear What Matters! Text-conditioned Selective Video-to-Audio Generation
Junwon Lee
Juhan Nam
Jiyoung Lee
DiffMVGen
107
0
0
02 Dec 2025
FlashVGGT: Efficient and Scalable Visual Geometry Transformers with Compressed Descriptor Attention
FlashVGGT: Efficient and Scalable Visual Geometry Transformers with Compressed Descriptor Attention
Zipeng Wang
Dan Xu
ViT
100
0
0
01 Dec 2025
ViscNet: Vision-Based In-line Viscometry for Fluid Mixing Process
ViscNet: Vision-Based In-line Viscometry for Fluid Mixing Process
Jongwon Sohn
Juhyeon Moon
Hyunjoon Jung
Jaewook Nam
132
0
0
01 Dec 2025
db-SP: Accelerating Sparse Attention for Visual Generative Models with Dual-Balanced Sequence Parallelism
db-SP: Accelerating Sparse Attention for Visual Generative Models with Dual-Balanced Sequence Parallelism
Siqi Chen
Ke Hong
Tianchen Zhao
Ruiqi Xie
Zhenhua Zhu
X. Zhang
Yu Wang
MoE
108
0
0
28 Nov 2025
GA2-CLIP: Generic Attribute Anchor for Efficient Prompt Tuningin Video-Language Models
GA2-CLIP: Generic Attribute Anchor for Efficient Prompt Tuningin Video-Language Models
Bin Wang
Ruotong Hu
Wenqian Wang
W. Li
Mingliang Gao
Runmin Cong
Wei Zhang
VLM
124
0
0
27 Nov 2025
Beyond Real versus Fake Towards Intent-Aware Video Analysis
Beyond Real versus Fake Towards Intent-Aware Video Analysis
Saurabh Atreya
Nabyl Quignon
Baptiste Chopin
Abhijit Das
A. Dantcheva
AAML
80
0
0
27 Nov 2025
Towards an Effective Action-Region Tracking Framework for Fine-grained Video Action Recognition
Towards an Effective Action-Region Tracking Framework for Fine-grained Video Action Recognition
Baoli Sun
Y. X. R. Wang
Xinzhu Ma
Zhihui Wang
Kun Lu
Zhiyong Wang
190
0
0
26 Nov 2025
Smooth regularization for efficient video recognition
Smooth regularization for efficient video recognition
Gil Goldman
Raja Giryes
Mahadev Satyanarayanan
AI4TS
203
0
0
25 Nov 2025
VideoCompressa: Data-Efficient Video Understanding via Joint Temporal Compression and Spatial Reconstruction
VideoCompressa: Data-Efficient Video Understanding via Joint Temporal Compression and Spatial Reconstruction
Shaobo Wang
Tianle Niu
Runkang Yang
Deshan Liu
Xu He
Zichen Wen
Conghui He
Xuming Hu
Linfeng Zhang
VGen
194
0
0
24 Nov 2025
Consolidating Diffusion-Generated Video Detection with Unified Multimodal Forgery Learning
Consolidating Diffusion-Generated Video Detection with Unified Multimodal Forgery Learning
Xiaohong Liu
Xiufeng Song
Huayu Zheng
Lei Bai
Xiaoming Liu
Guangtao Zhai
DiffM
140
0
0
22 Nov 2025
DeepCoT: Deep Continual Transformers for Real-Time Inference on Data Streams
DeepCoT: Deep Continual Transformers for Real-Time Inference on Data Streams
Ginés Carreto Picón
Peng Yuan Zhou
Qi Zhang
Alexandros Iosifidis
AI4TS
196
0
0
21 Nov 2025
Flow and Depth Assisted Video Prediction with Latent Transformer
Eliyas Suleyman
Paul Henderson
Eksan Firkat
Nicolas Pugeault
146
0
0
20 Nov 2025
RB-FT: Rationale-Bootstrapped Fine-Tuning for Video Classification
Meilong Xu
Di Fu
Jiaxing Zhang
Gong Yu
Jiayu Zheng
Xiaoling Hu
Dongdi Zhao
Feiyang Li
Chao Chen
Yong Cao
93
0
0
19 Nov 2025
End to End AI System for Surgical Gesture Sequence Recognition and Clinical Outcome Prediction
End to End AI System for Surgical Gesture Sequence Recognition and Clinical Outcome Prediction
Xi Li
Nicholas Matsumoto
Ujjwal Pasupulety
Atharva Deo
Cherine Yang
...
Jeanine Kim
Alvin C. Goh
C. Wagner
Geoffrey A. Sonn
A. Hung
86
0
0
14 Nov 2025
RodEpil: A Video Dataset of Laboratory Rodents for Seizure Detection and Benchmark Evaluation
RodEpil: A Video Dataset of Laboratory Rodents for Seizure Detection and Benchmark Evaluation
Daniele Perlo
Vladimir Despotovic
Selma Boudissa
Sang-Yoon Kim
P. V. Nazarov
Yanrong Zhang
Max Wintermark
O. Keunen
94
0
0
13 Nov 2025
Auto-US: An Ultrasound Video Diagnosis Agent Using Video Classification Framework and LLMs
Auto-US: An Ultrasound Video Diagnosis Agent Using Video Classification Framework and LLMs
Yuezhe Yang
Yiyue Guo
Wenjie Cai
Qingqing Ruan
Siying Wang
Xingbo Dong
Zhe Jin
Yong Dai
116
0
0
11 Nov 2025
Modulo Video Recovery via Selective Spatiotemporal Vision Transformer
Modulo Video Recovery via Selective Spatiotemporal Vision TransformerIEEE International Joint Conference on Neural Network (IJCNN), 2025
Tianyu Geng
Feng Ji
Wee Peng Tay
97
0
0
09 Nov 2025
Grounding Foundational Vision Models with 3D Human Poses for Robust Action Recognition
Grounding Foundational Vision Models with 3D Human Poses for Robust Action Recognition
Nicholas Babey
Tiffany Gu
Yiheng Li
Cristian Meo
Kevin Zhu
108
0
0
06 Nov 2025
SurgViVQA: Temporally-Grounded Video Question Answering for Surgical Scene Understanding
SurgViVQA: Temporally-Grounded Video Question Answering for Surgical Scene Understanding
Mauro Orazio Drago
Luca Carlini
Pelinsu Celebi Balyemez
Dennis Pierantozzi
Chiara Lena
Cesare Hassan
Danail Stoyanov
Elena De Momi
Sophia Bano
Mobarak I. Hoque
212
0
0
05 Nov 2025
SurgAnt-ViVQA: Learning to Anticipate Surgical Events through GRU-Driven Temporal Cross-Attention
SurgAnt-ViVQA: Learning to Anticipate Surgical Events through GRU-Driven Temporal Cross-Attention
Shreyas C. Dhake
Jiayuan Huang
Runlong He
Danyal Z. Khan
E. Mazomenos
Sophia Bano
Hani J. Marcus
Danail Stoyanov
Matthew J. Clarkson
Mobarak I. Hoque
64
0
0
05 Nov 2025
Dynamic Reflections: Probing Video Representations with Text Alignment
Dynamic Reflections: Probing Video Representations with Text Alignment
Tyler Zhu
Tengda Han
Leonidas Guibas
Viorica Patraucean
M. Ovsjanikov
VGen
249
0
0
04 Nov 2025
Anatomically Constrained Transformers for Echocardiogram Analysis
Anatomically Constrained Transformers for Echocardiogram Analysis
A. Thorley
A. Chartsias
Jordan Strom
Jeremy Slivnick
D. Kotecha
Alberto Gómez
Yanfu Zhang
ViTMedIm
116
0
0
02 Nov 2025
Enhancing Spatio-Temporal Zero-shot Action Recognition with Language-driven Description Attributes
Enhancing Spatio-Temporal Zero-shot Action Recognition with Language-driven Description AttributesPattern Recognition (Pattern Recogn.), 2025
Yehna Kim
Y. Kim
Seong-Whan Lee
VLM
120
0
0
31 Oct 2025
Transformers from Compressed Representations
Transformers from Compressed Representations
Juan Carlos León Alcázar
Mattia Soldan
Mohammad Saatialsoruji
Alejandro Pardo
Hani Itani
Juan C. Pérez
Bernard Ghanem
132
0
0
26 Oct 2025
AI-Driven Carbon Monitoring: Transformer-Based Reconstruction of Atmospheric CO2 in Canadian Poultry Regions
AI-Driven Carbon Monitoring: Transformer-Based Reconstruction of Atmospheric CO2 in Canadian Poultry Regions
Padmanabhan Jagannathan Prajesh
Kaliaperumal Ragunath
Miriam Gordon
Bruce Rathgeber
Suresh Neethirajan
60
0
0
26 Oct 2025
Relieving the Over-Aggregating Effect in Graph Transformers
Relieving the Over-Aggregating Effect in Graph Transformers
Junshu Sun
Wanxing Chang
Chenxue Yang
Qingming Huang
Shuhui Wang
132
0
0
24 Oct 2025
Towards Objective Obstetric Ultrasound Assessment: Contrastive Representation Learning for Fetal Movement Detection
Towards Objective Obstetric Ultrasound Assessment: Contrastive Representation Learning for Fetal Movement Detection
Talha Ilyas
Duong Nhu
Allison Thomas
Arie Levin
Lim Wei Yap
...
Maya Reddy
Euan Wallace
Wenlong Cheng
Zongyuan Ge
Faezeh Marzbanrad
102
0
0
23 Oct 2025
X-Ego: Acquiring Team-Level Tactical Situational Awareness via Cross-Egocentric Contrastive Video Representation Learning
X-Ego: Acquiring Team-Level Tactical Situational Awareness via Cross-Egocentric Contrastive Video Representation Learning
Yunzhe Wang
Soham Hans
Volkan Ustun
EgoV
206
0
0
22 Oct 2025
A Renaissance of Explicit Motion Information Mining from Transformers for Action Recognition
A Renaissance of Explicit Motion Information Mining from Transformers for Action Recognition
Peiqin Zhuang
Wenlong Zhang
Yichao Wu
Ding Liang
Luping Zhou
Yali Wang
Wanli Ouyang
199
0
0
21 Oct 2025
StretchySnake: Flexible SSM Training Unlocks Action Recognition Across Spatio-Temporal Scales
StretchySnake: Flexible SSM Training Unlocks Action Recognition Across Spatio-Temporal Scales
Nyle Siddiqui
Rohit Gupta
S. Swetha
Mubarak Shah
152
0
0
17 Oct 2025
Chronos-2: From Univariate to Universal Forecasting
Chronos-2: From Univariate to Universal Forecasting
Abdul Fatir Ansari
Oleksandr Shchur
Jaris Küken
Andreas Auer
Boran Han
...
Hao Wang
Huzefa Rangwala
George Karypis
Yuyang Wang
Michael Bohlke-Schneider
AI4TSBDL
233
6
0
17 Oct 2025
DGME-T: Directional Grid Motion Encoding for Transformer-Based Historical Camera Movement Classification
DGME-T: Directional Grid Motion Encoding for Transformer-Based Historical Camera Movement Classification
Tingyu Lin
Armin Dadras
Florian Kleber
Robert Sablatnig
VGen
89
0
0
17 Oct 2025
VisCoP: Visual Probing for Video Domain Adaptation of Vision Language Models
VisCoP: Visual Probing for Video Domain Adaptation of Vision Language Models
Dominick Reilly
Manish Kumar Govind
Le Xue
Srijan Das
VLM
148
0
0
15 Oct 2025
State Space Prompting via Gathering and Spreading Spatio-Temporal Information for Video Understanding
State Space Prompting via Gathering and Spreading Spatio-Temporal Information for Video Understanding
Jiahuan Zhou
Kai Zhu
Zhenyu Cui
Zichen Liu
Xu Zou
Gang Hua
88
1
0
14 Oct 2025
SpikePool: Event-driven Spiking Transformer with Pooling Attention
SpikePool: Event-driven Spiking Transformer with Pooling Attention
Donghyun Lee
Alex Sima
Yuhang Li
Panos Stinis
Priyadarshini Panda
88
0
0
14 Oct 2025
DAWP: A framework for global observation forecasting via Data Assimilation and Weather Prediction in satellite observation space
DAWP: A framework for global observation forecasting via Data Assimilation and Weather Prediction in satellite observation space
Junchao Gong
Jingyi Xu
Ben Fei
Zhangrui Li
W. Zhang
Kun Chen
Wanghan Xu
Weidong Yang
Xiaokang Yang
Lei Bai
120
0
0
13 Oct 2025
Structured Spectral Graph Representation Learning for Multi-label Abnormality Analysis from 3D CT Scans
Structured Spectral Graph Representation Learning for Multi-label Abnormality Analysis from 3D CT Scans
Theo Di Piazza
Carole Lazarus
O. Nempont
L. Boussel
235
0
0
12 Oct 2025
TARO: Toward Semantically Rich Open-World Object Detection
TARO: Toward Semantically Rich Open-World Object Detection
Yuchen Zhang
Yao Lu
Johannes Betz
57
0
0
10 Oct 2025
Federated Learning for Surgical Vision in Appendicitis Classification: Results of the FedSurg EndoVis 2024 Challenge
Federated Learning for Surgical Vision in Appendicitis Classification: Results of the FedSurg EndoVis 2024 Challenge
Max Kirchner
Hanna Hoffmann
Alexander C. Jenke
Oliver Saldanha
Kevin Pfeiffer
...
Danail Stoyanov
Jakob N. Kather
Fiona Kolbinger
S. Bodenstedt
Stefanie Speidel
FedML
169
0
0
06 Oct 2025
Latent Uncertainty Representations for Video-based Driver Action and Intention Recognition
Latent Uncertainty Representations for Video-based Driver Action and Intention Recognition
Koen Vellenga
H. Steinhauer
Jonas Andersson
Anders Sjögren
139
0
0
06 Oct 2025
Brain-Language Model Alignment: Insights into the Platonic Hypothesis and Intermediate-Layer Advantage
Brain-Language Model Alignment: Insights into the Platonic Hypothesis and Intermediate-Layer Advantage
Angela Lopez-Cardona
Sebastian Idesis
Mireia Masias Bruns
Sergi Abadal
Ioannis Arapakis
104
0
0
03 Oct 2025
SALSA-V: Shortcut-Augmented Long-form Synchronized Audio from Videos
SALSA-V: Shortcut-Augmented Long-form Synchronized Audio from Videos
Amir Dellali
Luca A. Lanzendörfer
Florian Grötschla
Roger Wattenhofer
VGen
116
0
0
03 Oct 2025
Visual Odometry with Transformers
Visual Odometry with Transformers
V. Yugay
Duy-Kien Nguyen
Theo Gevers
Cees G. M. Snoek
Martin R. Oswald
144
1
0
02 Oct 2025
Stacked Regression using Off-the-shelf, Stimulus-tuned and Fine-tuned Neural Networks for Predicting fMRI Brain Responses to Movies (Algonauts 2025 Report)
Stacked Regression using Off-the-shelf, Stimulus-tuned and Fine-tuned Neural Networks for Predicting fMRI Brain Responses to Movies (Algonauts 2025 Report)
Robert Scholz
Kunal Bagga
Christine Ahrends
Carlo Alberto Barbano
93
0
0
02 Oct 2025
Aligning Video Models with Human Social Judgments via Behavior-Guided Fine-Tuning
Aligning Video Models with Human Social Judgments via Behavior-Guided Fine-Tuning
Kathy Garcia
Leyla Isik
109
0
0
01 Oct 2025
POVQA: Preference-Optimized Video Question Answering with Rationales for Data Efficiency
POVQA: Preference-Optimized Video Question Answering with Rationales for Data Efficiency
Ashim Dahal
Ankit Ghimire
Saydul Akbar Murad
Nick Rahimi
142
0
0
01 Oct 2025
VRWKV-Editor: Reducing quadratic complexity in transformer-based video editing
VRWKV-Editor: Reducing quadratic complexity in transformer-based video editing
Abdelilah Aitrouga
Youssef Hmamouche
Amal El Fallah Seghrouchni
VGen
214
0
0
30 Sep 2025
UniVid: The Open-Source Unified Video Model
UniVid: The Open-Source Unified Video Model
Jiabin Luo
Junhui Lin
Zeyu Zhang
Biao Wu
Meng Fang
Ling-Hao Chen
Hao Tang
VGen
276
7
0
29 Sep 2025
1234...252627
Next