Communities
Connect sessions
AI calendar
Organizations
Join Slack
Contact Sales

Terms and Conditions

Twitter GitHub LinkedIn Bluesky Youtube

© 2026 ResearchTrend.AI, All rights reserved.

Home
Papers
2103.15691
Cited By

ViViT: A Video Vision Transformer

v1v2 (latest)

ViViT: A Video Vision Transformer

IEEE International Conference on Computer Vision (ICCV), 2021

29 March 2021

Mostafa Dehghani

Cordelia Schmid

ArXiv (abs)PDF HTML HuggingFace (3 upvotes)Github (3544★)

Papers citing "ViViT: A Video Vision Transformer"

50 / 1,311 papers shown

VideoMamba: Spatio-Temporal Selective State Space Model

VideoMamba: Spatio-Temporal Selective State Space Model

Hee-Seon Kim

Changick Kim

289

23

0

11 Jul 2024

Hypergraph Multi-modal Large Language Model: Exploiting EEG and
Eye-tracking Modalities to Evaluate Heterogeneous Responses for Video
Understanding

Hypergraph Multi-modal Large Language Model: Exploiting EEG and Eye-tracking Modalities to Evaluate Heterogeneous Responses for Video Understanding

...

Ping Wang

323

6

0

11 Jul 2024

Toto: Time Series Optimized Transformer for Observability

Toto: Time Series Optimized Transformer for Observability

Othmane Abou-Amal

267

15

0

10 Jul 2024

Video-to-Audio Generation with Hidden Alignment

Video-to-Audio Generation with Hidden Alignment

Yu Gu

Dong Yu

284

24

0

10 Jul 2024

Rethinking Image-to-Video Adaptation: An Object-centric Perspective

Rethinking Image-to-Video Adaptation: An Object-centric Perspective

Rui Qian

Dahua Lin

237

8

0

09 Jul 2024

Masked Video and Body-worn IMU Autoencoder for Egocentric Action
Recognition

Masked Video and Body-worn IMU Autoencoder for Egocentric Action Recognition

206

17

0

09 Jul 2024

C2C: Component-to-Composition Learning for Zero-Shot Compositional
Action Recognition

C2C: Component-to-Composition Learning for Zero-Shot Compositional Action Recognition

409

11

0

08 Jul 2024

Improving ensemble extreme precipitation forecasts using generative
artificial intelligence

Improving ensemble extreme precipitation forecasts using generative artificial intelligence

David John Gagne II

258

6

0

05 Jul 2024

PosMLP-Video: Spatial and Temporal Relative Position Encoding for
Efficient Video Recognition

PosMLP-Video: Spatial and Temporal Relative Position Encoding for Efficient Video Recognition

245

12

0

03 Jul 2024

Semantically Guided Representation Learning For Action Anticipation

Semantically Guided Representation Learning For Action Anticipation

Federico Fontana

216

6

0

02 Jul 2024

Joint-Dataset Learning and Cross-Consistent Regularization for
Text-to-Motion Retrieval

Joint-Dataset Learning and Cross-Consistent Regularization for Text-to-Motion Retrieval

J. Sedmidubský

Fabrizio Falchi

Tomáš Rebok

226

0

0

02 Jul 2024

TransferAttn: Transferable-guided Attention Is All You Need for Video
Domain Adaptation

TransferAttn: Transferable-guided Attention Is All You Need for Video Domain Adaptation

Andre Sacilotti

Samuel Felipe dos Santos

Andrii Zadaianchuk

Jurandy Almeida

253

0

0

01 Jul 2024

Aeroengine performance prediction using a physical-embedded data-driven
method

Aeroengine performance prediction using a physical-embedded data-driven method

158

2

0

29 Jun 2024

Enhancing Video-Language Representations with Structural Spatio-Temporal
Alignment

Enhancing Video-Language Representations with Structural Spatio-Temporal Alignment

Hao Fei

Meishan Zhang

277

66

0

27 Jun 2024

LoongTrain: Efficient Training of Long-Sequence LLMs with Head-Context
Parallelism

LoongTrain: Efficient Training of Long-Sequence LLMs with Head-Context Parallelism

Peng Sun

Qinghao Hu

Xun Chen

...

Tianwei Zhang

193

17

0

26 Jun 2024

Dark Transformer: A Video Transformer for Action Recognition in the Dark

Dark Transformer: A Video Transformer for Action Recognition in the Dark

230

0

0

25 Jun 2024

SVFormer: A Direct Training Spiking Transformer for Efficient Video
Action Recognition

SVFormer: A Direct Training Spiking Transformer for Efficient Video Action Recognition

Han Zhang

238

7

0

21 Jun 2024

Accessible, At-Home Detection of Parkinson's Disease via Multi-task Video Analysis

Accessible, At-Home Detection of Parkinson's Disease via Multi-task Video Analysis

Md. Saiful Islam

Abdelrahman Abdelkader

...

Ruth B. Schneider

303

6

0

21 Jun 2024

Exploring the Impact of Hand Pose and Shadow on Hand-washing Action
Recognition

Exploring the Impact of Hand Pose and Shadow on Hand-washing Action Recognition

138

2

0

19 Jun 2024

A Primal-Dual Framework for Transformers and Neural Networks

A Primal-Dual Framework for Transformers and Neural Networks

Tan M. Nguyen

Tam Nguyen

Andrea L. Bertozzi

Richard G. Baraniuk

Stanley J. Osher

196

16

0

19 Jun 2024

GVT2RPM: An Empirical Study for General Video Transformer Adaptation to
Remote Physiological Measurement

GVT2RPM: An Empirical Study for General Video Transformer Adaptation to Remote Physiological Measurement

179

0

0

19 Jun 2024

ViLCo-Bench: VIdeo Language COntinual learning Benchmark

ViLCo-Bench: VIdeo Language COntinual learning BenchmarkNeural Information Processing Systems (NeurIPS), 2024

Shohreh Deldari

Flora D. Salim

273

5

0

19 Jun 2024

Recognition of Dynamic Hand Gestures in Long Distance using a Web-Camera
for Robot Guidance

Recognition of Dynamic Hand Gestures in Long Distance using a Web-Camera for Robot Guidance

Eran Bamani Beeri

167

0

0

18 Jun 2024

LieRE: Lie Rotational Positional Encodings

LieRE: Lie Rotational Positional Encodings

Sophie Ostmeier

Michael E. Moseley

Akshay S. Chaudhari

Akshay Chaudhari

354

1

0

14 Jun 2024

Cross-Modal Learning for Anomaly Detection in Fused Magnesium Smelting
Process: Methodology and Benchmark

Cross-Modal Learning for Anomaly Detection in Fused Magnesium Smelting Process: Methodology and Benchmark

209

20

0

13 Jun 2024

Advanced Multimodal Deep Learning Architecture for Image-Text Matching

Advanced Multimodal Deep Learning Architecture for Image-Text Matching

347

28

0

13 Jun 2024

Skim then Focus: Integrating Contextual and Fine-grained Views for
Repetitive Action Counting

Skim then Focus: Integrating Contextual and Fine-grained Views for Repetitive Action Counting

Zhengqi Zhao

Xiaohu Huang

Errui Ding

Jingdong Wang

Xinggang Wang

Wenyu Liu

172

2

0

13 Jun 2024

Adaptively Bypassing Vision Transformer Blocks for Efficient Visual
Tracking

Adaptively Bypassing Vision Transformer Blocks for Efficient Visual Tracking

301

17

0

12 Jun 2024

Image and Video Tokenization with Binary Spherical Quantization

Image and Video Tokenization with Binary Spherical Quantization

Philipp Krahenbuhl

263

59

0

11 Jun 2024

AID: Adapting Image2Video Diffusion Models for Instruction-guided Video
Prediction

AID: Adapting Image2Video Diffusion Models for Instruction-guided Video Prediction

Yu-Gang Jiang

295

23

0

10 Jun 2024

Video-Language Understanding: A Survey from Model Architecture, Model Training, and Data Perspectives

Video-Language Understanding: A Survey from Model Architecture, Model Training, and Data PerspectivesAnnual Meeting of the Association for Computational Linguistics (ACL), 2024

Jay Zhangjie Wu

Cong-Duy Nguyen

583

26

1

09 Jun 2024

MTS-Net: Dual-Enhanced Positional Multi-Head Self-Attention for 3D CT Diagnosis of May-Thurner Syndrome

MTS-Net: Dual-Enhanced Positional Multi-Head Self-Attention for 3D CT Diagnosis of May-Thurner SyndromeBiomedical Signal Processing and Control (BSPC), 2024

344

0

0

07 Jun 2024

VidMuse: A Simple Video-to-Music Generation Framework with Long-Short-Term Modeling

VidMuse: A Simple Video-to-Music Generation Framework with Long-Short-Term Modeling

Xiaoqiang Huang

646

31

0

06 Jun 2024

FILS: Self-Supervised Video Feature Prediction In Semantic Language
Space

FILS: Self-Supervised Video Feature Prediction In Semantic Language Space

Andrew Gilbert

333

2

0

05 Jun 2024

Population Transformer: Learning Population-level Representations of Neural Activity

Population Transformer: Learning Population-level Representations of Neural Activity

Christopher Wang

Sabera Talukder

Vighnesh Subramaniam

Saraswati Soedarmadji

Boris Katz

412

21

0

05 Jun 2024

UniAnimate: Taming Unified Video Diffusion Models for Consistent Human
Image Animation

UniAnimate: Taming Unified Video Diffusion Models for Consistent Human Image Animation

322

76

0

03 Jun 2024

RNNs, CNNs and Transformers in Human Action Recognition: A Survey and a
Hybrid Model

RNNs, CNNs and Transformers in Human Action Recognition: A Survey and a Hybrid Model

Halil Ibrahim Aysel

293

26

0

02 Jun 2024

Exploiting Frequency Correlation for Hyperspectral Image Reconstruction

Exploiting Frequency Correlation for Hyperspectral Image Reconstruction

437

2

0

02 Jun 2024

DroneVis: Versatile Computer Vision Library for Drones

DroneVis: Versatile Computer Vision Library for Drones

221

2

0

01 Jun 2024

MSPE: Multi-Scale Patch Embedding Prompts Vision Transformers to Any
Resolution

MSPE: Multi-Scale Patch Embedding Prompts Vision Transformers to Any Resolution

230

4

0

28 May 2024

PTM-VQA: Efficient Video Quality Assessment Leveraging Diverse
PreTrained Models from the Wild

PTM-VQA: Efficient Video Quality Assessment Leveraging Diverse PreTrained Models from the Wild

173

14

0

28 May 2024

Hierarchical Action Recognition: A Contrastive Video-Language Approach
with Hierarchical Interactions

Hierarchical Action Recognition: A Contrastive Video-Language Approach with Hierarchical Interactions

271

1

0

28 May 2024

Flow Snapshot Neurons in Action: Deep Neural Networks Generalize to
Biological Motion Perception

Flow Snapshot Neurons in Action: Deep Neural Networks Generalize to Biological Motion Perception

272

1

0

26 May 2024

Planted: a dataset for planted forest identification from
multi-satellite time series

Planted: a dataset for planted forest identification from multi-satellite time series

L. M. Pazos-Outón

Cristina Nader Vasconcelos

200

8

0

24 May 2024

ARVideo: Autoregressive Pretraining for Self-Supervised Video
Representation Learning

ARVideo: Autoregressive Pretraining for Self-Supervised Video Representation Learning

Cihang Xie

216

2

0

24 May 2024

Enhanced Spatiotemporal Prediction Using Physical-guided And
Frequency-enhanced Recurrent Neural Networks

Enhanced Spatiotemporal Prediction Using Physical-guided And Frequency-enhanced Recurrent Neural Networks

Bo Xu

219

3

0

23 May 2024

Attending to Topological Spaces: The Cellular Transformer

Attending to Topological Spaces: The Cellular Transformer

Rubén Ballester

Pablo Hernández-García

Claudio Battiloro

Carles Casacuberta

Sergio Escalera

Pavlo Vasylenko

323

6

0

23 May 2024

Scaling-laws for Large Time-series Models

Scaling-laws for Large Time-series Models

Thomas D. P. Edwards

Benjamin Dan Wandelt

258

16

0

22 May 2024

From CNNs to Transformers in Multimodal Human Action Recognition: A
Survey

From CNNs to Transformers in Multimodal Human Action Recognition: A Survey

Muhammad Bilal Shaikh

Syed Mohammed Shamsul Islam

347

30

0

22 May 2024

BIMM: Brain Inspired Masked Modeling for Video Representation Learning

BIMM: Brain Inspired Masked Modeling for Video Representation Learning

243

0

0

21 May 2024

1 2 3...7 8 9...25 26 27