Communities
Connect sessions
AI calendar
Organizations
Join Slack
Contact Sales

Terms and Conditions

Twitter GitHub LinkedIn Bluesky Youtube

© 2026 ResearchTrend.AI, All rights reserved.

Home
Papers
2103.15691
Cited By

ViViT: A Video Vision Transformer

v1v2 (latest)

ViViT: A Video Vision Transformer

IEEE International Conference on Computer Vision (ICCV), 2021

29 March 2021

Mostafa Dehghani

Cordelia Schmid

ArXiv (abs)PDF HTML HuggingFace (3 upvotes)Github (3544★)

Papers citing "ViViT: A Video Vision Transformer"

50 / 1,309 papers shown

EEA: Exploration-Exploitation Agent for Long Video Understanding

EEA: Exploration-Exploitation Agent for Long Video Understanding

60

0

0

03 Dec 2025

Reasoning-Aware Multimodal Fusion for Hateful Video Detection

Reasoning-Aware Multimodal Fusion for Hateful Video Detection

Guangliang Cheng

256

0

0

02 Dec 2025

Hear What Matters! Text-conditioned Selective Video-to-Audio Generation

Hear What Matters! Text-conditioned Selective Video-to-Audio Generation

107

0

0

02 Dec 2025

FlashVGGT: Efficient and Scalable Visual Geometry Transformers with Compressed Descriptor Attention

FlashVGGT: Efficient and Scalable Visual Geometry Transformers with Compressed Descriptor Attention

100

0

0

01 Dec 2025

ViscNet: Vision-Based In-line Viscometry for Fluid Mixing Process

ViscNet: Vision-Based In-line Viscometry for Fluid Mixing Process

132

0

0

01 Dec 2025

db-SP: Accelerating Sparse Attention for Visual Generative Models with Dual-Balanced Sequence Parallelism

db-SP: Accelerating Sparse Attention for Visual Generative Models with Dual-Balanced Sequence Parallelism

108

0

0

28 Nov 2025

GA2-CLIP: Generic Attribute Anchor for Efficient Prompt Tuningin Video-Language Models

GA2-CLIP: Generic Attribute Anchor for Efficient Prompt Tuningin Video-Language Models

124

0

0

27 Nov 2025

Beyond Real versus Fake Towards Intent-Aware Video Analysis

Beyond Real versus Fake Towards Intent-Aware Video Analysis

Baptiste Chopin

80

0

0

27 Nov 2025

Towards an Effective Action-Region Tracking Framework for Fine-grained Video Action Recognition

Towards an Effective Action-Region Tracking Framework for Fine-grained Video Action Recognition

190

0

0

26 Nov 2025

Smooth regularization for efficient video recognition

Smooth regularization for efficient video recognition

Mahadev Satyanarayanan

203

0

0

25 Nov 2025

VideoCompressa: Data-Efficient Video Understanding via Joint Temporal Compression and Spatial Reconstruction

VideoCompressa: Data-Efficient Video Understanding via Joint Temporal Compression and Spatial Reconstruction

194

0

0

24 Nov 2025

Consolidating Diffusion-Generated Video Detection with Unified Multimodal Forgery Learning

Consolidating Diffusion-Generated Video Detection with Unified Multimodal Forgery Learning

140

0

0

22 Nov 2025

DeepCoT: Deep Continual Transformers for Real-Time Inference on Data Streams

DeepCoT: Deep Continual Transformers for Real-Time Inference on Data Streams

Ginés Carreto Picón

Alexandros Iosifidis

196

0

0

21 Nov 2025

Flow and Depth Assisted Video Prediction with Latent Transformer

Eliyas Suleyman

Nicolas Pugeault

146

0

0

20 Nov 2025

RB-FT: Rationale-Bootstrapped Fine-Tuning for Video Classification

93

0

0

19 Nov 2025

End to End AI System for Surgical Gesture Sequence Recognition and Clinical Outcome Prediction

End to End AI System for Surgical Gesture Sequence Recognition and Clinical Outcome Prediction

Nicholas Matsumoto

Ujjwal Pasupulety

...

Geoffrey A. Sonn

86

0

0

14 Nov 2025

RodEpil: A Video Dataset of Laboratory Rodents for Seizure Detection and Benchmark Evaluation

RodEpil: A Video Dataset of Laboratory Rodents for Seizure Detection and Benchmark Evaluation

Vladimir Despotovic

94

0

0

13 Nov 2025

Auto-US: An Ultrasound Video Diagnosis Agent Using Video Classification Framework and LLMs

Auto-US: An Ultrasound Video Diagnosis Agent Using Video Classification Framework and LLMs

116

0

0

11 Nov 2025

Modulo Video Recovery via Selective Spatiotemporal Vision Transformer

Modulo Video Recovery via Selective Spatiotemporal Vision TransformerIEEE International Joint Conference on Neural Network (IJCNN), 2025

97

0

0

09 Nov 2025

Grounding Foundational Vision Models with 3D Human Poses for Robust Action Recognition

Grounding Foundational Vision Models with 3D Human Poses for Robust Action Recognition

108

0

0

06 Nov 2025

SurgViVQA: Temporally-Grounded Video Question Answering for Surgical Scene Understanding

SurgViVQA: Temporally-Grounded Video Question Answering for Surgical Scene Understanding

Mauro Orazio Drago

Pelinsu Celebi Balyemez

Dennis Pierantozzi

Danail Stoyanov

Mobarak I. Hoque

212

0

0

05 Nov 2025

SurgAnt-ViVQA: Learning to Anticipate Surgical Events through GRU-Driven Temporal Cross-Attention

SurgAnt-ViVQA: Learning to Anticipate Surgical Events through GRU-Driven Temporal Cross-Attention

Shreyas C. Dhake

Danail Stoyanov

Matthew J. Clarkson

Mobarak I. Hoque

64

0

0

05 Nov 2025

Dynamic Reflections: Probing Video Representations with Text Alignment

Dynamic Reflections: Probing Video Representations with Text Alignment

Leonidas Guibas

Viorica Patraucean

249

0

0

04 Nov 2025

Anatomically Constrained Transformers for Echocardiogram Analysis

Anatomically Constrained Transformers for Echocardiogram Analysis

Jeremy Slivnick

116

0

0

02 Nov 2025

Enhancing Spatio-Temporal Zero-shot Action Recognition with Language-driven Description Attributes

Enhancing Spatio-Temporal Zero-shot Action Recognition with Language-driven Description AttributesPattern Recognition (Pattern Recogn.), 2025

Yehna Kim

120

0

0

31 Oct 2025

Transformers from Compressed Representations

Transformers from Compressed Representations

Juan Carlos León Alcázar

Mohammad Saatialsoruji

Alejandro Pardo

132

0

0

26 Oct 2025

AI-Driven Carbon Monitoring: Transformer-Based Reconstruction of Atmospheric CO2 in Canadian Poultry Regions

AI-Driven Carbon Monitoring: Transformer-Based Reconstruction of Atmospheric CO2 in Canadian Poultry Regions

Padmanabhan Jagannathan Prajesh

Kaliaperumal Ragunath

Bruce Rathgeber

Suresh Neethirajan

60

0

0

26 Oct 2025

Relieving the Over-Aggregating Effect in Graph Transformers

Relieving the Over-Aggregating Effect in Graph Transformers

132

0

0

24 Oct 2025

Towards Objective Obstetric Ultrasound Assessment: Contrastive Representation Learning for Fetal Movement Detection

Towards Objective Obstetric Ultrasound Assessment: Contrastive Representation Learning for Fetal Movement Detection

...

Faezeh Marzbanrad

102

0

0

23 Oct 2025

X-Ego: Acquiring Team-Level Tactical Situational Awareness via Cross-Egocentric Contrastive Video Representation Learning

X-Ego: Acquiring Team-Level Tactical Situational Awareness via Cross-Egocentric Contrastive Video Representation Learning

206

0

0

22 Oct 2025

A Renaissance of Explicit Motion Information Mining from Transformers for Action Recognition

A Renaissance of Explicit Motion Information Mining from Transformers for Action Recognition

199

0

0

21 Oct 2025

StretchySnake: Flexible SSM Training Unlocks Action Recognition Across Spatio-Temporal Scales

StretchySnake: Flexible SSM Training Unlocks Action Recognition Across Spatio-Temporal Scales

152

0

0

17 Oct 2025

Chronos-2: From Univariate to Universal Forecasting

Chronos-2: From Univariate to Universal Forecasting

Abdul Fatir Ansari

Oleksandr Shchur

...

Huzefa Rangwala

Michael Bohlke-Schneider

233

6

0

17 Oct 2025

DGME-T: Directional Grid Motion Encoding for Transformer-Based Historical Camera Movement Classification

DGME-T: Directional Grid Motion Encoding for Transformer-Based Historical Camera Movement Classification

Robert Sablatnig

89

0

0

17 Oct 2025

VisCoP: Visual Probing for Video Domain Adaptation of Vision Language Models

VisCoP: Visual Probing for Video Domain Adaptation of Vision Language Models

Dominick Reilly

Manish Kumar Govind

148

0

0

15 Oct 2025

State Space Prompting via Gathering and Spreading Spatio-Temporal Information for Video Understanding

State Space Prompting via Gathering and Spreading Spatio-Temporal Information for Video Understanding

88

1

0

14 Oct 2025

SpikePool: Event-driven Spiking Transformer with Pooling Attention

SpikePool: Event-driven Spiking Transformer with Pooling Attention

Priyadarshini Panda

88

0

0

14 Oct 2025

DAWP: A framework for global observation forecasting via Data Assimilation and Weather Prediction in satellite observation space

DAWP: A framework for global observation forecasting via Data Assimilation and Weather Prediction in satellite observation space

120

0

0

13 Oct 2025

Structured Spectral Graph Representation Learning for Multi-label Abnormality Analysis from 3D CT Scans

Structured Spectral Graph Representation Learning for Multi-label Abnormality Analysis from 3D CT Scans

235

0

0

12 Oct 2025

TARO: Toward Semantically Rich Open-World Object Detection

TARO: Toward Semantically Rich Open-World Object Detection

57

0

0

10 Oct 2025

Federated Learning for Surgical Vision in Appendicitis Classification: Results of the FedSurg EndoVis 2024 Challenge

Federated Learning for Surgical Vision in Appendicitis Classification: Results of the FedSurg EndoVis 2024 Challenge

Alexander C. Jenke

Oliver Saldanha

...

Danail Stoyanov

Jakob N. Kather

Fiona Kolbinger

Stefanie Speidel

169

0

0

06 Oct 2025

Latent Uncertainty Representations for Video-based Driver Action and Intention Recognition

Latent Uncertainty Representations for Video-based Driver Action and Intention Recognition

Jonas Andersson

Anders Sjögren

139

0

0

06 Oct 2025

Brain-Language Model Alignment: Insights into the Platonic Hypothesis and Intermediate-Layer Advantage

Brain-Language Model Alignment: Insights into the Platonic Hypothesis and Intermediate-Layer Advantage

Angela Lopez-Cardona

Sebastian Idesis

Mireia Masias Bruns

Ioannis Arapakis

104

0

0

03 Oct 2025

SALSA-V: Shortcut-Augmented Long-form Synchronized Audio from Videos

SALSA-V: Shortcut-Augmented Long-form Synchronized Audio from Videos

Luca A. Lanzendörfer

Florian Grötschla

Roger Wattenhofer

116

0

0

03 Oct 2025

Visual Odometry with Transformers

Visual Odometry with Transformers

Duy-Kien Nguyen

Cees G. M. Snoek

Martin R. Oswald

144

1

0

02 Oct 2025

Stacked Regression using Off-the-shelf, Stimulus-tuned and Fine-tuned Neural Networks for Predicting fMRI Brain Responses to Movies (Algonauts 2025 Report)

Stacked Regression using Off-the-shelf, Stimulus-tuned and Fine-tuned Neural Networks for Predicting fMRI Brain Responses to Movies (Algonauts 2025 Report)

Christine Ahrends

Carlo Alberto Barbano

93

0

0

02 Oct 2025

Aligning Video Models with Human Social Judgments via Behavior-Guided Fine-Tuning

Aligning Video Models with Human Social Judgments via Behavior-Guided Fine-Tuning

109

0

0

01 Oct 2025

POVQA: Preference-Optimized Video Question Answering with Rationales for Data Efficiency

POVQA: Preference-Optimized Video Question Answering with Rationales for Data Efficiency

Saydul Akbar Murad

142

0

0

01 Oct 2025

VRWKV-Editor: Reducing quadratic complexity in transformer-based video editing

VRWKV-Editor: Reducing quadratic complexity in transformer-based video editing

Abdelilah Aitrouga

Youssef Hmamouche

Amal El Fallah Seghrouchni

214

0

0

30 Sep 2025

UniVid: The Open-Source Unified Video Model

UniVid: The Open-Source Unified Video Model

276

7

0

29 Sep 2025

1 2 3 4...25 26 27