ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2103.15691
  4. Cited By
ViViT: A Video Vision Transformer
v1v2 (latest)

ViViT: A Video Vision Transformer

IEEE International Conference on Computer Vision (ICCV), 2021
29 March 2021
Anurag Arnab
Mostafa Dehghani
G. Heigold
Chen Sun
Mario Lucic
Cordelia Schmid
    ViT
ArXiv (abs)PDFHTMLHuggingFace (3 upvotes)Github (3544★)

Papers citing "ViViT: A Video Vision Transformer"

50 / 1,306 papers shown
Title
SAGE: A Visual Language Model for Anomaly Detection via Fact Enhancement and Entropy-aware Alignment
SAGE: A Visual Language Model for Anomaly Detection via Fact Enhancement and Entropy-aware Alignment
Guoxin Zang
Xue Li
Donglin Di
Lanshun Nie
Dechen Zhan
Yang Song
Lei Fan
VLM
264
1
0
10 Jul 2025
Depth-Sequence Transformer (DST) for Segment-Specific ICA Calcification Mapping on Non-Contrast CT
Depth-Sequence Transformer (DST) for Segment-Specific ICA Calcification Mapping on Non-Contrast CT
Xiangjian Hou
Ebru Yaman Akcicek
Xin Wang
Kazem Hashemizadeh
Scott Mcnally
Chun Yuan
Xiaodong Ma
144
0
0
10 Jul 2025
Mamba Goes HoME: Hierarchical Soft Mixture-of-Experts for 3D Medical Image Segmentation
Mamba Goes HoME: Hierarchical Soft Mixture-of-Experts for 3D Medical Image Segmentation
Szymon Płotka
Gizem Mert
Maciej Chrabaszcz
Ewa Szczurek
Arkadiusz Sitek
MambaMoE
212
1
0
08 Jul 2025
Large Language Models for Crash Detection in Video: A Survey of Methods, Datasets, and Challenges
Large Language Models for Crash Detection in Video: A Survey of Methods, Datasets, and Challenges
Sanjeda Akter
Ibne Farabi Shihab
Anuj Sharma
VLM
273
2
0
02 Jul 2025
Video-Guided Text-to-Music Generation Using Public Domain Movie Collections
Video-Guided Text-to-Music Generation Using Public Domain Movie Collections
Haven Kim
Cheng-i Wang
Weihan Xu
Julian McAuley
Hao-Wen Dong
VGen
221
4
0
01 Jul 2025
CountLLM: Towards Generalizable Repetitive Action Counting via Large Language Model
CountLLM: Towards Generalizable Repetitive Action Counting via Large Language ModelComputer Vision and Pattern Recognition (CVPR), 2025
Ziyu Yao
Xuxin Cheng
Zhiqi Huang
Lei Li
356
5
0
01 Jul 2025
ActAlign: Zero-Shot Fine-Grained Video Classification via Language-Guided Sequence Alignment
ActAlign: Zero-Shot Fine-Grained Video Classification via Language-Guided Sequence Alignment
Amir Aghdam
Vincent Tao Hu
Bjorn Ommer
VLM
239
2
0
28 Jun 2025
TAMMs: Temporal-Aware Multimodal Model for Satellite Image Change Understanding and Forecasting
TAMMs: Temporal-Aware Multimodal Model for Satellite Image Change Understanding and Forecasting
Zhongbin Guo
Yuhao Wang
Ping Jian
Chengzhi Li
Xinyue Chen
Zhen Yang
Ertai E
231
0
0
23 Jun 2025
Efficient Retail Video Annotation: A Robust Key Frame Generation Approach for Product and Customer Interaction Analysis
Efficient Retail Video Annotation: A Robust Key Frame Generation Approach for Product and Customer Interaction Analysis
Varun Mannam
Zhenyu Shi
209
0
0
17 Jun 2025
Advances in Compliance Detection: Novel Models Using Vision-Based Tactile Sensors
Advances in Compliance Detection: Novel Models Using Vision-Based Tactile Sensors
Ziteng Li
Malte Kuhlmann
Ilana Nisky
Nicolás Navarro-Guerrero
89
0
0
17 Jun 2025
DejaVid: Encoder-Agnostic Learned Temporal Matching for Video Classification
DejaVid: Encoder-Agnostic Learned Temporal Matching for Video ClassificationComputer Vision and Pattern Recognition (CVPR), 2025
Darryl Ho
Samuel Madden
AI4TS
174
0
0
14 Jun 2025
FAME: A Lightweight Spatio-Temporal Network for Model Attribution of Face-Swap Deepfakes
FAME: A Lightweight Spatio-Temporal Network for Model Attribution of Face-Swap DeepfakesExpert systems with applications (ESWA), 2025
Wasim Ahmad
Yan-Tsung Peng
Yuan-Hao Chang
CVBM
199
4
0
13 Jun 2025
An Effective End-to-End Solution for Multimodal Action RecognitionInternational Conference on Pattern Recognition (ICPR), 2025
Songping Wang
Xiantao Hu
Yueming Lyu
Caifeng Shan
215
2
0
11 Jun 2025
Abstraction-Based Proof Production in Formal Verification of Neural Networks
Abstraction-Based Proof Production in Formal Verification of Neural Networks
Y. Elboher
Omri Isac
Guy Katz
Tobias Ladner
Haoze Wu
184
2
0
11 Jun 2025
Synthetic Human Action Video Data Generation with Pose Transfer
Synthetic Human Action Video Data Generation with Pose Transfer
Vaclav Knapp
Matyas Bohacek
194
1
0
11 Jun 2025
MLVTG: Mamba-Based Feature Alignment and LLM-Driven Purification for Multi-Modal Video Temporal Grounding
Zhiyi Zhu
Xiaoyu Wu
Zihao Liu
Linlin Yang
214
0
0
10 Jun 2025
Super Encoding Network: Recursive Association of Multi-Modal Encoders for Video Understanding
Super Encoding Network: Recursive Association of Multi-Modal Encoders for Video Understanding
Boyu Chen
Siran Chen
Kunchang Li
Qinglin Xu
Yu Qiao
Yali Wang
VOS
283
7
0
09 Jun 2025
Sleep Stage Classification using Multimodal Embedding Fusion from EOG and PSM
Sleep Stage Classification using Multimodal Embedding Fusion from EOG and PSM
Olivier Papillon
Rafik Goubran
James Green
Julien Larivière-Chartier
Caitlin Higginson
Frank Knoefel
Rébecca Robillard
154
0
0
07 Jun 2025
Bridging Perspectives: A Survey on Cross-view Collaborative Intelligence with Egocentric-Exocentric Vision
Bridging Perspectives: A Survey on Cross-view Collaborative Intelligence with Egocentric-Exocentric Vision
Yuping He
Yifei Huang
Guo Chen
Lidong Lu
Baoqi Pei
Jilan Xu
Tong Lu
Yoichi Sato
EgoV
357
2
0
06 Jun 2025
Time-Lapse Video-Based Embryo Grading via Complementary Spatial-Temporal Pattern MiningInternational Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI), 2025
Yong Sun
Yipeng Wang
Junyu Shi
Zhiyuan Zhang
Yanmei Xiao
Lei Zhu
Manxi Jiang
Qiang Nie
VGen
186
2
0
05 Jun 2025
Fine-Tuning Video Transformers for Word-Level Bangla Sign Language: A Comparative Analysis for Classification Tasks
Jubayer Ahmed Bhuiyan Shawon
H. Mahmud
Kamrul Hasan
120
0
0
04 Jun 2025
Video, How Do Your Tokens Merge?
Video, How Do Your Tokens Merge?
Sam Pollard
Michael Wray
ViTMoMe
229
1
0
04 Jun 2025
FaceSleuth-R: Adaptive Orientation-Aware Attention for Robust Micro-Expression Recognition
FaceSleuth-R: Adaptive Orientation-Aware Attention for Robust Micro-Expression Recognition
Linquan Wu
Tianxiang Jiang
Haoyu Yang
Wenhao Duan
Shaochao Lin
Zixuan Wang
Yini Fang
Jacky Keung
277
0
0
03 Jun 2025
Clinical Annotations for Automatic Stuttering Severity Assessment
Clinical Annotations for Automatic Stuttering Severity Assessment
Ana Rita Valente
Rufael Marew
Hawau Olamide Toyin
Hamdan Al-Ali
Anelise Bohnen
Inma Becerra
Elsa Marta Soares
Goncalo Leal
Hanan Aldarmaki
89
0
0
31 May 2025
DiG-Net: Enhancing Quality of Life through Hyper-Range Dynamic Gesture Recognition in Assistive Robotics
DiG-Net: Enhancing Quality of Life through Hyper-Range Dynamic Gesture Recognition in Assistive Robotics
Eran Bamani Beeri
Eden Nissinman
A. Sintov
141
0
0
30 May 2025
One Trajectory, One Token: Grounded Video Tokenization via Panoptic Sub-object Trajectory
One Trajectory, One Token: Grounded Video Tokenization via Panoptic Sub-object Trajectory
Chenhao Zheng
Jieyu Zhang
Mohammadreza Salehi
Ziqi Gao
Vishnu Iyengar
Norimasa Kobori
Quan Kong
Ranjay Krishna
331
2
0
29 May 2025
MUSEG: Reinforcing Video Temporal Understanding via Timestamp-Aware Multi-Segment Grounding
MUSEG: Reinforcing Video Temporal Understanding via Timestamp-Aware Multi-Segment Grounding
Fuwen Luo
Shengfeng Lou
C. L. Philip Chen
Ziyue Wang
Chenliang Li
...
Peng Li
Ming Yan
Ji Zhang
Fei Huang
Wenshu Fan
AI4TSLRM
216
3
0
27 May 2025
Rotary Masked Autoencoders are Versatile Learners
Rotary Masked Autoencoders are Versatile Learners
Uros Zivanovic
Serafina Di Gioia
Andre Scaffidi
Martín de los Rios
Gabriella Contardo
R. Trotta
248
0
0
26 May 2025
CA3D: Convolutional-Attentional 3D Nets for Efficient Video Activity Recognition on the Edge
CA3D: Convolutional-Attentional 3D Nets for Efficient Video Activity Recognition on the Edge
Gabriele Lagani
Fabrizio Falchi
Claudio Gennaro
Giuseppe Amato
146
1
0
26 May 2025
Multi-modal brain encoding models for multi-modal stimuli
Multi-modal brain encoding models for multi-modal stimuliInternational Conference on Learning Representations (ICLR), 2025
R. Mamidi
Khushbu Pahwa
Mounika Marreddy
Maneesh Singh
Subba Reddy Oota
Bapi S. Raju
152
8
0
26 May 2025
VerIPO: Cultivating Long Reasoning in Video-LLMs via Verifier-Gudied Iterative Policy Optimization
VerIPO: Cultivating Long Reasoning in Video-LLMs via Verifier-Gudied Iterative Policy Optimization
Yunxin Li
Xinyu Chen
Zitao Li
Zhenyu Liu
L. Wang
Tong Lu
Baotian Hu
Min Zhang
OffRLLRM
383
7
0
25 May 2025
Advancing Video Self-Supervised Learning via Image Foundation Models
Advancing Video Self-Supervised Learning via Image Foundation ModelsPattern Recognition Letters (Pattern Recogn. Lett.), 2025
Jingwei Wu
Zhewei Huang
Chang Liu
172
0
0
25 May 2025
Temporal Consistency Constrained Transferable Adversarial Attacks with Background Mixup for Action RecognitionInternational Joint Conference on Artificial Intelligence (IJCAI), 2025
Ping Li
Jianan Ni
Bo Pang
AAML
378
0
0
23 May 2025
Dual Branch VideoMamba with Gated Class Token Fusion for Violence Detection
Dual Branch VideoMamba with Gated Class Token Fusion for Violence Detection
Damith Chamalke Senadeera
Xiaoyun Yang
Shibo Li
Muhammad Awais
Dimitrios Kollias
Gregory G. Slabaugh
Mamba
161
1
0
23 May 2025
Temporal Object Captioning for Street Scene Videos from LiDAR Tracks
Temporal Object Captioning for Street Scene Videos from LiDAR Tracks
Vignesh Gopinathan
Urs Zimmermann
Michael Arnold
Matthias Rottmann
186
0
0
22 May 2025
Time to Embed: Unlocking Foundation Models for Time Series with Channel Descriptions
Time to Embed: Unlocking Foundation Models for Time Series with Channel Descriptions
Utsav Dutta
Sina Khoshfetrat Pakazad
Henrik Ohlsson
AI4TSAIFin
309
0
0
20 May 2025
This Time is Different: An Observability Perspective on Time Series Foundation Models
This Time is Different: An Observability Perspective on Time Series Foundation Models
Ben Cohen
Emaad Khwaja
Youssef Doubli
Salahidine Lemaachi
Chris Lettieri
...
Zongzhe Xu
Viktoriya Zhukova
David Asker
Ameet Talwalkar
Othmane Abou-Amal
AI4TSAI4CE
423
11
0
20 May 2025
BusterX: MLLM-Powered AI-Generated Video Forgery Detection and Explanation
BusterX: MLLM-Powered AI-Generated Video Forgery Detection and Explanation
Haiquan Wen
Yiwei He
Zhenglin Huang
Tianxiao Li
Zihan Yu
Xingru Huang
Lu Qi
Baoyuan Wu
Xuelong Li
Guangliang Cheng
VGen
427
4
0
19 May 2025
SpikeVideoFormer: An Efficient Spike-Driven Video Transformer with Hamming Attention and $\mathcal{O}(T)$ Complexity
SpikeVideoFormer: An Efficient Spike-Driven Video Transformer with Hamming Attention and O(T)\mathcal{O}(T)O(T) Complexity
Shihao Zou
Qingfeng Li
Wei Ji
Jingjing Li
Yongkui Yang
Guoqi Li
Chao Dong
314
1
0
15 May 2025
A Comparative Study of Human Activity Recognition: Motion, Tactile, and multi-modal Approaches
A Comparative Study of Human Activity Recognition: Motion, Tactile, and multi-modal Approaches
Valerio Belcamino
Nhat Minh Dinh Le
Quan Khanh Luu
Alessandro Carfì
V. A. Ho
Fulvio Mastrogiovanni
236
1
0
13 May 2025
TSLFormer: A Lightweight Transformer Model for Turkish Sign Language Recognition Using Skeletal Landmarks
TSLFormer: A Lightweight Transformer Model for Turkish Sign Language Recognition Using Skeletal Landmarks
Kutay Ertürk
Furkan Altınışık
İrem Sarıaltın
Ömer Nezih Gerek
SLR
445
1
0
11 May 2025
Decentralized LoRA Augmented Transformer with Context-aware Multi-scale Feature Learning for Secured Eye Diagnosis
Decentralized LoRA Augmented Transformer with Context-aware Multi-scale Feature Learning for Secured Eye Diagnosis
Md. Naimur Asif Borno
Md Sakib Hossain Shovon
MD Hanif Sikder
Iffat Firozy Rimi
Tahani Jaser Alahmadi
Mohammad Ali Moni
MedIm
144
0
0
11 May 2025
Apple: Toward General Active Perception via Reinforcement Learning
Apple: Toward General Active Perception via Reinforcement Learning
Tim Schneider
Cristiana de Farias
Roberto Calandra
Lawrence Yunliang Chen
Jan Peters
919
1
0
09 May 2025
Let Humanoids Hike! Integrative Skill Development on Complex Trails
Let Humanoids Hike! Integrative Skill Development on Complex TrailsComputer Vision and Pattern Recognition (CVPR), 2025
Kwan-Yee Lin
Stella X.Yu
308
3
0
09 May 2025
Task-Oriented Multimodal Token Transmission in Resource-Constrained Multiuser Networks
Task-Oriented Multimodal Token Transmission in Resource-Constrained Multiuser NetworksIEEE Wireless Communications Letters (WCL), 2025
Junhe Zhang
Wanli Ni
Pengwei Wang
Dongyu Wang
231
0
0
06 May 2025
Artificial Behavior Intelligence: Technology, Challenges, and Future Directions
Artificial Behavior Intelligence: Technology, Challenges, and Future DirectionsInternational Conference on Human System Interaction (HSI), 2025
Kanghyun Jo
Jehwan Choi
Kwanho Kim
Seongmin Kim
Duy-Linh Nguyen
Xuan-Thuy Vo
Adri Priadana
Tien-Dat Tran
AI4CE
313
0
0
06 May 2025
Deep Learning for Sports Video Event Detection: Tasks, Datasets, Methods, and Challenges
Deep Learning for Sports Video Event Detection: Tasks, Datasets, Methods, and Challenges
Hao Xu
Arbind Agrahari Baniya
Sam Well
Mohamed Reda Bouadjenek
Richard Dazeley
S. Aryal
AI4TS
283
3
0
06 May 2025
DiffVQA: Video Quality Assessment Using Diffusion Feature Extractor
DiffVQA: Video Quality Assessment Using Diffusion Feature Extractor
Wei-Ting Chen
Yu-Jiet Vong
Yi-Tsung Lee
Sy-Yen Kuo
Qiang Gao
Sizhuo Ma
Jian Wang
971
1
0
06 May 2025
seq-JEPA: Autoregressive Predictive Learning of Invariant-Equivariant World Models
seq-JEPA: Autoregressive Predictive Learning of Invariant-Equivariant World Models
Hafez Ghaemi
Eilif Muller
Shahab Bakhtiari
466
1
0
06 May 2025
Token Coordinated Prompt Attention is Needed for Visual Prompting
Token Coordinated Prompt Attention is Needed for Visual Prompting
Zichen Liu
Xu Zou
Gang Hua
Jiahuan Zhou
505
3
0
05 May 2025
Previous
123456...252627
Next