v1v2 (latest)

ViViT: A Video Vision Transformer

IEEE International Conference on Computer Vision (ICCV), 2021

29 March 2021

ArXiv (abs)PDF HTML HuggingFace (3 upvotes)Github (3544★)

Papers citing "ViViT: A Video Vision Transformer"

50 / 1,311 papers shown

Simplifying Traffic Anomaly Detection with Video Foundation Models

121

12 Jul 2025

PPJudge: Towards Human-Aligned Assessment of Artistic Painting Process

145

12 Jul 2025

Depth-Sequence Transformer (DST) for Segment-Specific ICA Calcification Mapping on Non-Contrast CT

175

10 Jul 2025

SAGE: A Visual Language Model for Anomaly Detection via Fact Enhancement and Entropy-aware Alignment

302

10 Jul 2025

Mamba Goes HoME: Hierarchical Soft Mixture-of-Experts for 3D Medical Image Segmentation

234

08 Jul 2025

Large Language Models for Crash Detection in Video: A Survey of Methods, Datasets, and Challenges

303

02 Jul 2025

Video-Guided Text-to-Music Generation Using Public Domain Movie Collections

258

01 Jul 2025

CountLLM: Towards Generalizable Repetitive Action Counting via Large Language ModelComputer Vision and Pattern Recognition (CVPR), 2025

409

01 Jul 2025

ActAlign: Zero-Shot Fine-Grained Video Classification via Language-Guided Sequence Alignment

293

28 Jun 2025

TAMMs: Temporal-Aware Multimodal Model for Satellite Image Change Understanding and Forecasting

284

23 Jun 2025

Advances in Compliance Detection: Novel Models Using Vision-Based Tactile Sensors

Ziteng Li

Malte Kuhlmann

Ilana Nisky

Nicolás Navarro-Guerrero

125

17 Jun 2025

Efficient Retail Video Annotation: A Robust Key Frame Generation Approach for Product and Customer Interaction Analysis

Varun Mannam

Zhenyu Shi

240

17 Jun 2025

DejaVid: Encoder-Agnostic Learned Temporal Matching for Video ClassificationComputer Vision and Pattern Recognition (CVPR), 2025

Darryl Ho

Samuel Madden

AI4TS

201

14 Jun 2025

FAME: A Lightweight Spatio-Temporal Network for Model Attribution of Face-Swap DeepfakesExpert systems with applications (ESWA), 2025

226

13 Jun 2025

An Effective End-to-End Solution for Multimodal Action RecognitionInternational Conference on Pattern Recognition (ICPR), 2025

242

11 Jun 2025

Synthetic Human Action Video Data Generation with Pose Transfer

Vaclav Knapp

Matyas Bohacek

263

11 Jun 2025

Abstraction-Based Proof Production in Formal Verification of Neural Networks

213

11 Jun 2025

MLVTG: Mamba-Based Feature Alignment and LLM-Driven Purification for Multi-Modal Video Temporal Grounding

227

10 Jun 2025

Super Encoding Network: Recursive Association of Multi-Modal Encoders for Video Understanding

382

09 Jun 2025

Sleep Stage Classification using Multimodal Embedding Fusion from EOG and PSM

Olivier Papillon

Rafik Goubran

James Green

Julien Larivière-Chartier

Caitlin Higginson

Frank Knoefel

Rébecca Robillard

191

07 Jun 2025

Bridging Perspectives: A Survey on Cross-view Collaborative Intelligence with Egocentric-Exocentric Vision

392

06 Jun 2025

Time-Lapse Video-Based Embryo Grading via Complementary Spatial-Temporal Pattern MiningInternational Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI), 2025

217

05 Jun 2025

Fine-Tuning Video Transformers for Word-Level Bangla Sign Language: A Comparative Analysis for Classification Tasks

Jubayer Ahmed Bhuiyan Shawon

H. Mahmud

Kamrul Hasan

155

04 Jun 2025

Video, How Do Your Tokens Merge?

Sam Pollard

Michael Wray

ViT MoMe

270

04 Jun 2025

FaceSleuth-R: Adaptive Orientation-Aware Attention for Robust Micro-Expression Recognition

336

03 Jun 2025

Clinical Annotations for Automatic Stuttering Severity Assessment

31 May 2025

DiG-Net: Enhancing Quality of Life through Hyper-Range Dynamic Gesture Recognition in Assistive Robotics

Eran Bamani Beeri

Eden Nissinman

A. Sintov

156

30 May 2025

One Trajectory, One Token: Grounded Video Tokenization via Panoptic Sub-object Trajectory

382

29 May 2025

MUSEG: Reinforcing Video Temporal Understanding via Timestamp-Aware Multi-Segment Grounding

...

249

27 May 2025

CA3D: Convolutional-Attentional 3D Nets for Efficient Video Activity Recognition on the Edge

158

26 May 2025

Rotary Masked Autoencoders are Versatile Learners

278

26 May 2025

Multi-modal brain encoding models for multi-modal stimuliInternational Conference on Learning Representations (ICLR), 2025

189

26 May 2025

VerIPO: Cultivating Long Reasoning in Video-LLMs via Verifier-Gudied Iterative Policy Optimization

399

25 May 2025

Advancing Video Self-Supervised Learning via Image Foundation ModelsPattern Recognition Letters (Pattern Recogn. Lett.), 2025

Jingwei Wu

Zhewei Huang

Chang Liu

218

25 May 2025

Temporal Consistency Constrained Transferable Adversarial Attacks with Background Mixup for Action RecognitionInternational Joint Conference on Artificial Intelligence (IJCAI), 2025

422

23 May 2025

Dual Branch VideoMamba with Gated Class Token Fusion for Violence Detection

Damith Chamalke Senadeera

222

23 May 2025

Temporal Object Captioning for Street Scene Videos from LiDAR Tracks

244

22 May 2025

Time to Embed: Unlocking Foundation Models for Time Series with Channel Descriptions

Utsav Dutta

Sina Khoshfetrat Pakazad

Henrik Ohlsson

AI4TS AIFin

337

20 May 2025

This Time is Different: An Observability Perspective on Time Series Foundation Models

...

492

20 May 2025

BusterX: MLLM-Powered AI-Generated Video Forgery Detection and Explanation

551

19 May 2025

$SpikeVideoFormer: An Efficient Spike-Driven Video Transformer with Hamming Attention and $\mathcal{O}(T)$ Complexity$

SpikeVideoFormer: An Efficient Spike-Driven Video Transformer with Hamming Attention and

\mathcal{O}(T)

359

15 May 2025

A Comparative Study of Human Activity Recognition: Motion, Tactile, and multi-modal Approaches

Fulvio Mastrogiovanni

265

13 May 2025

TSLFormer: A Lightweight Transformer Model for Turkish Sign Language Recognition Using Skeletal Landmarks

455

11 May 2025

Decentralized LoRA Augmented Transformer with Context-aware Multi-scale Feature Learning for Secured Eye Diagnosis

Md. Naimur Asif Borno

Md Sakib Hossain Shovon

MD Hanif Sikder

Iffat Firozy Rimi

Tahani Jaser Alahmadi

Mohammad Ali Moni

MedIm

156

11 May 2025

Apple: Toward General Active Perception via Reinforcement Learning

Tim Schneider

Cristiana de Farias

Roberto Calandra

Lawrence Yunliang Chen

Jan Peters

997

09 May 2025

Let Humanoids Hike! Integrative Skill Development on Complex TrailsComputer Vision and Pattern Recognition (CVPR), 2025

Kwan-Yee Lin

Stella X.Yu

349

09 May 2025

DiffVQA: Video Quality Assessment Using Diffusion Feature Extractor

1.0K

06 May 2025

Deep Learning for Sports Video Event Detection: Tasks, Datasets, Methods, and Challenges

Hao Xu

Arbind Agrahari Baniya

Sam Well

Mohamed Reda Bouadjenek

Richard Dazeley

S. Aryal

AI4TS

330

06 May 2025

seq-JEPA: Autoregressive Predictive Learning of Invariant-Equivariant World Models

Hafez Ghaemi

Eilif Muller

Shahab Bakhtiari

539

06 May 2025

Task-Oriented Multimodal Token Transmission in Resource-Constrained Multiuser NetworksIEEE Wireless Communications Letters (WCL), 2025

285

06 May 2025