v1v2 (latest)

Space-time Mixing Attention for Video Transformer

Neural Information Processing Systems (NeurIPS), 2021

10 June 2021

Adrian Bulat

Juan-Manuel Perez-Rua

Swathikiran Sudhakaran

Brais Martínez

Georgios Tzimiropoulos

ViT

ArXiv (abs)PDF HTML Github

Papers citing "Space-time Mixing Attention for Video Transformer"

50 / 77 papers shown

Smooth regularization for efficient video recognition

Gil Goldman

Raja Giryes

Mahadev Satyanarayanan

AI4TS

305

25 Nov 2025

Sparse Transformer for Ultra-sparse Sampled Video Compressive Sensing

221

10 Sep 2025

SRVP: Strong Recollection Video Prediction Model Using Attention-Based Spatiotemporal Correlation Fusion

Yuseon Kim

Kyongseok Park

404

10 Apr 2025

Principles of Visual Tokens for Efficient Video Understanding

540

20 Nov 2024

FE-Adapter: Adapting Image-based Emotion Classifiers to VideosIEEE International Conference on Automatic Face & Gesture Recognition (FG), 2024

Shreyank N. Gowda

Boyan Gao

David A. Clifton

267

05 Aug 2024

PosMLP-Video: Spatial and Temporal Relative Position Encoding for Efficient Video Recognition

313

03 Jul 2024

Hybrid Feature Collaborative Reconstruction Network for Few-Shot Fine-Grained Image Classification

Shulei Qiu

Wanqi Yang

Ming Yang

288

02 Jul 2024

A Survey on Backbones for Deep Video Action Recognition

193

09 May 2024

Learning Correlation Structures for Vision Transformers

369

05 Apr 2024

OmniVid: A Generative Framework for Universal Video Understanding

Lu Yuan

Zuxuan Wu

Yu-Gang Jiang

VLM VGen

344

26 Mar 2024

Computer Vision for Primate Behavior Analysis in the Wild

...

519

29 Jan 2024

GPT4Ego: Unleashing the Potential of Pre-trained Models for Zero-Shot Egocentric Action Recognition

478

18 Jan 2024

Collaboratively Self-supervised Video Representation Learning for Action RecognitionIEEE Transactions on Information Forensics and Security (IEEE TIFS), 2024

516

15 Jan 2024

Motion Guided Token Compression for Efficient Masked Video Modeling

300

10 Jan 2024

Video Recognition in Portrait Mode

Mingfei Han

Linjie Yang

Xiaojie Jin

Jiashi Feng

Xiaojun Chang

Heng Wang

266

21 Dec 2023

Adapting Short-Term Transformers for Action Detection in Untrimmed VideosComputer Vision and Pattern Recognition (CVPR), 2023

367

04 Dec 2023

Learning Human Action Recognition Representations Without Real HumansNeural Information Processing Systems (NeurIPS), 2023

369

10 Nov 2023

Building an Open-Vocabulary Video CLIP Model with Better Architectures, Optimization and DataIEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2023

Zuxuan Wu

298

08 Oct 2023

ZeroI2V: Zero-Cost Adaptation of Pre-trained Transformers from Image to VideoEuropean Conference on Computer Vision (ECCV), 2023

Xinhao Li

Yuhan Zhu

Limin Wang

VLM

358

02 Oct 2023

Disentangling Spatial and Temporal Learning for Efficient Image-to-Video Transfer LearningIEEE International Conference on Computer Vision (ICCV), 2023

266

14 Sep 2023

COMEDIAN: Self-Supervised Learning and Knowledge Distillation for Action Spotting using Transformers

409

03 Sep 2023

Computation-efficient Deep Learning for Computer Vision: A Survey

Yulin Wang

Gao Huang

363

27 Aug 2023

Multi-stage Factorized Spatio-Temporal Representation for RGB-D Action and Gesture RecognitionACM Multimedia (ACM MM), 2023

294

23 Aug 2023

Joint learning of images and videos with a single Vision Transformer

Shuki Shimizu

Toru Tamaki

ViT

211

21 Aug 2023

Temporally-Adaptive Models for Efficient Video Understanding

Ziwei Liu

243

10 Aug 2023

Prune Spatio-temporal Tokens by Semantic-aware Temporal AccumulationIEEE International Conference on Computer Vision (ICCV), 2023

253

08 Aug 2023

Multimodal Distillation for Egocentric Action RecognitionIEEE International Conference on Computer Vision (ICCV), 2023

Gorjan Radevski

Dusan Grujicic

Marie-Francine Moens

Matthew Blaschko

Tinne Tuytelaars

EgoV

420

14 Jul 2023

Free-Form Composition Networks for Egocentric Action Recognition

Yibing Zhan

Liang Ding

365

13 Jul 2023

Cross-view Action Recognition Understanding From Exocentric to Egocentric PerspectiveNeurocomputing (Neurocomputing), 2023

Thanh-Dat Truong

Khoa Luu

EgoV

447

25 May 2023

LOGO-Former: Local-Global Spatio-Temporal Transformer for Dynamic Facial Expression RecognitionIEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2023

195

05 May 2023

Efficient Video Action Detection with Token Dropout and Context RefinementIEEE International Conference on Computer Vision (ICCV), 2023

Lei Chen

Zhan Tong

Yibing Song

Gangshan Wu

Limin Wang

361

17 Apr 2023

MC-ViViT: Multi-branch Classifier-ViViT to detect Mild Cognitive Impairment in older adults using facial videosExpert systems with applications (ESWA), 2023

Jian Sun

H. H. Dodge

Mohammad H. Mahoor

372

11 Apr 2023

DIR-AS: Decoupling Individual Identification and Temporal Reasoning for Action Segmentation

Peiyao Wang

Haibin Ling

184

04 Apr 2023

AutoLabel: CLIP-based framework for Open-set Video Domain AdaptationComputer Vision and Pattern Recognition (CVPR), 2023

313

03 Apr 2023

SVT: Supertoken Video Transformer for Efficient Video Understanding

Madian Khabsa

366

01 Apr 2023

Streaming Video ModelComputer Vision and Pattern Recognition (CVPR), 2023

285

30 Mar 2023

TimeBalance: Temporally-Invariant and Temporally-Distinctive Video Representations for Semi-Supervised Action RecognitionComputer Vision and Pattern Recognition (CVPR), 2023

281

28 Mar 2023

PhysFormer++: Facial Video-based Physiological Measurement with SlowFast Temporal Difference TransformerInternational Journal of Computer Vision (IJCV), 2023

Jingang Shi

280

127

07 Feb 2023

$Optical Flow Estimation in 360$^\circ$ Videos: Dataset, Model and Application$

Optical Flow Estimation in 360

^\circ

Videos: Dataset, Model and Application

Bin Duan

Keshav Bhandari

Gaowen Liu

Yan Yan

211

27 Jan 2023

Revisiting Temporal Modeling for CLIP-based Image-to-Video Knowledge TransferringComputer Vision and Pattern Recognition (CVPR), 2023

301

26 Jan 2023

Cross-Modal Learning with 3D Deformable Attention for Action RecognitionIEEE International Conference on Computer Vision (ICCV), 2022

390

12 Dec 2022

Masked Video Distillation: Rethinking Masked Feature Modeling for Self-supervised Video Representation LearningComputer Vision and Pattern Recognition (CVPR), 2022

Zuxuan Wu

Lu Yuan

435

127

08 Dec 2022

Lightweight Structure-Aware Attention for Visual UnderstandingInternational Journal of Computer Vision (IJCV), 2022

248

29 Nov 2022

EVEREST: Efficient Masked Video Autoencoder by Removing Redundant Spatiotemporal TokensInternational Conference on Machine Learning (ICML), 2022

463

19 Nov 2022

UniFormerV2: Spatiotemporal Learning by Arming Image ViTs with Video UniFormer

Yi Wang

Yu Qiao

268

172

17 Nov 2022

InternVideo-Ego4D: A Pack of Champion Solutions to Ego4D Challenges

...

Yu Qiao

188

17 Nov 2022

SCOTCH and SODA: A Transformer Video Shadow Detection FrameworkComputer Vision and Pattern Recognition (CVPR), 2022

Lei Zhu

Carola-Bibiane Schönlieb

Angelica I Aviles-Rivero

307

13 Nov 2022

PatchBlender: A Motion Prior for Video Transformers

Gabriele Prato

Yale Song

Janarthanan Rajendran

225

11 Nov 2022

Linear Video Transformer with Feature Fixation

Zhen Qin

...

Yuchao Dai

249

15 Oct 2022

On the Surprising Effectiveness of Transformers in Low-Labeled Video Recognition

311

15 Sep 2022