Communities
Connect sessions
AI calendar
Organizations
Join Slack
Contact Sales
Search
Open menu
Home
Papers
2103.15691
Cited By
v1
v2 (latest)
ViViT: A Video Vision Transformer
IEEE International Conference on Computer Vision (ICCV), 2021
29 March 2021
Anurag Arnab
Mostafa Dehghani
G. Heigold
Chen Sun
Mario Lucic
Cordelia Schmid
ViT
Re-assign community
ArXiv (abs)
PDF
HTML
HuggingFace (3 upvotes)
Github (3544★)
Papers citing
"ViViT: A Video Vision Transformer"
50 / 1,306 papers shown
Title
SAGE: A Visual Language Model for Anomaly Detection via Fact Enhancement and Entropy-aware Alignment
Guoxin Zang
Xue Li
Donglin Di
Lanshun Nie
Dechen Zhan
Yang Song
Lei Fan
VLM
264
1
0
10 Jul 2025
Depth-Sequence Transformer (DST) for Segment-Specific ICA Calcification Mapping on Non-Contrast CT
Xiangjian Hou
Ebru Yaman Akcicek
Xin Wang
Kazem Hashemizadeh
Scott Mcnally
Chun Yuan
Xiaodong Ma
144
0
0
10 Jul 2025
Mamba Goes HoME: Hierarchical Soft Mixture-of-Experts for 3D Medical Image Segmentation
Szymon Płotka
Gizem Mert
Maciej Chrabaszcz
Ewa Szczurek
Arkadiusz Sitek
Mamba
MoE
212
1
0
08 Jul 2025
Large Language Models for Crash Detection in Video: A Survey of Methods, Datasets, and Challenges
Sanjeda Akter
Ibne Farabi Shihab
Anuj Sharma
VLM
273
2
0
02 Jul 2025
Video-Guided Text-to-Music Generation Using Public Domain Movie Collections
Haven Kim
Cheng-i Wang
Weihan Xu
Julian McAuley
Hao-Wen Dong
VGen
221
4
0
01 Jul 2025
CountLLM: Towards Generalizable Repetitive Action Counting via Large Language Model
Computer Vision and Pattern Recognition (CVPR), 2025
Ziyu Yao
Xuxin Cheng
Zhiqi Huang
Lei Li
356
5
0
01 Jul 2025
ActAlign: Zero-Shot Fine-Grained Video Classification via Language-Guided Sequence Alignment
Amir Aghdam
Vincent Tao Hu
Bjorn Ommer
VLM
239
2
0
28 Jun 2025
TAMMs: Temporal-Aware Multimodal Model for Satellite Image Change Understanding and Forecasting
Zhongbin Guo
Yuhao Wang
Ping Jian
Chengzhi Li
Xinyue Chen
Zhen Yang
Ertai E
231
0
0
23 Jun 2025
Efficient Retail Video Annotation: A Robust Key Frame Generation Approach for Product and Customer Interaction Analysis
Varun Mannam
Zhenyu Shi
209
0
0
17 Jun 2025
Advances in Compliance Detection: Novel Models Using Vision-Based Tactile Sensors
Ziteng Li
Malte Kuhlmann
Ilana Nisky
Nicolás Navarro-Guerrero
89
0
0
17 Jun 2025
DejaVid: Encoder-Agnostic Learned Temporal Matching for Video Classification
Computer Vision and Pattern Recognition (CVPR), 2025
Darryl Ho
Samuel Madden
AI4TS
174
0
0
14 Jun 2025
FAME: A Lightweight Spatio-Temporal Network for Model Attribution of Face-Swap Deepfakes
Expert systems with applications (ESWA), 2025
Wasim Ahmad
Yan-Tsung Peng
Yuan-Hao Chang
CVBM
199
4
0
13 Jun 2025
An Effective End-to-End Solution for Multimodal Action Recognition
International Conference on Pattern Recognition (ICPR), 2025
Songping Wang
Xiantao Hu
Yueming Lyu
Caifeng Shan
215
2
0
11 Jun 2025
Abstraction-Based Proof Production in Formal Verification of Neural Networks
Y. Elboher
Omri Isac
Guy Katz
Tobias Ladner
Haoze Wu
184
2
0
11 Jun 2025
Synthetic Human Action Video Data Generation with Pose Transfer
Vaclav Knapp
Matyas Bohacek
194
1
0
11 Jun 2025
MLVTG: Mamba-Based Feature Alignment and LLM-Driven Purification for Multi-Modal Video Temporal Grounding
Zhiyi Zhu
Xiaoyu Wu
Zihao Liu
Linlin Yang
214
0
0
10 Jun 2025
Super Encoding Network: Recursive Association of Multi-Modal Encoders for Video Understanding
Boyu Chen
Siran Chen
Kunchang Li
Qinglin Xu
Yu Qiao
Yali Wang
VOS
283
7
0
09 Jun 2025
Sleep Stage Classification using Multimodal Embedding Fusion from EOG and PSM
Olivier Papillon
Rafik Goubran
James Green
Julien Larivière-Chartier
Caitlin Higginson
Frank Knoefel
Rébecca Robillard
154
0
0
07 Jun 2025
Bridging Perspectives: A Survey on Cross-view Collaborative Intelligence with Egocentric-Exocentric Vision
Yuping He
Yifei Huang
Guo Chen
Lidong Lu
Baoqi Pei
Jilan Xu
Tong Lu
Yoichi Sato
EgoV
357
2
0
06 Jun 2025
Time-Lapse Video-Based Embryo Grading via Complementary Spatial-Temporal Pattern Mining
International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI), 2025
Yong Sun
Yipeng Wang
Junyu Shi
Zhiyuan Zhang
Yanmei Xiao
Lei Zhu
Manxi Jiang
Qiang Nie
VGen
186
2
0
05 Jun 2025
Fine-Tuning Video Transformers for Word-Level Bangla Sign Language: A Comparative Analysis for Classification Tasks
Jubayer Ahmed Bhuiyan Shawon
H. Mahmud
Kamrul Hasan
120
0
0
04 Jun 2025
Video, How Do Your Tokens Merge?
Sam Pollard
Michael Wray
ViT
MoMe
229
1
0
04 Jun 2025
FaceSleuth-R: Adaptive Orientation-Aware Attention for Robust Micro-Expression Recognition
Linquan Wu
Tianxiang Jiang
Haoyu Yang
Wenhao Duan
Shaochao Lin
Zixuan Wang
Yini Fang
Jacky Keung
277
0
0
03 Jun 2025
Clinical Annotations for Automatic Stuttering Severity Assessment
Ana Rita Valente
Rufael Marew
Hawau Olamide Toyin
Hamdan Al-Ali
Anelise Bohnen
Inma Becerra
Elsa Marta Soares
Goncalo Leal
Hanan Aldarmaki
89
0
0
31 May 2025
DiG-Net: Enhancing Quality of Life through Hyper-Range Dynamic Gesture Recognition in Assistive Robotics
Eran Bamani Beeri
Eden Nissinman
A. Sintov
141
0
0
30 May 2025
One Trajectory, One Token: Grounded Video Tokenization via Panoptic Sub-object Trajectory
Chenhao Zheng
Jieyu Zhang
Mohammadreza Salehi
Ziqi Gao
Vishnu Iyengar
Norimasa Kobori
Quan Kong
Ranjay Krishna
331
2
0
29 May 2025
MUSEG: Reinforcing Video Temporal Understanding via Timestamp-Aware Multi-Segment Grounding
Fuwen Luo
Shengfeng Lou
C. L. Philip Chen
Ziyue Wang
Chenliang Li
...
Peng Li
Ming Yan
Ji Zhang
Fei Huang
Wenshu Fan
AI4TS
LRM
216
3
0
27 May 2025
Rotary Masked Autoencoders are Versatile Learners
Uros Zivanovic
Serafina Di Gioia
Andre Scaffidi
Martín de los Rios
Gabriella Contardo
R. Trotta
248
0
0
26 May 2025
CA3D: Convolutional-Attentional 3D Nets for Efficient Video Activity Recognition on the Edge
Gabriele Lagani
Fabrizio Falchi
Claudio Gennaro
Giuseppe Amato
146
1
0
26 May 2025
Multi-modal brain encoding models for multi-modal stimuli
International Conference on Learning Representations (ICLR), 2025
R. Mamidi
Khushbu Pahwa
Mounika Marreddy
Maneesh Singh
Subba Reddy Oota
Bapi S. Raju
152
8
0
26 May 2025
VerIPO: Cultivating Long Reasoning in Video-LLMs via Verifier-Gudied Iterative Policy Optimization
Yunxin Li
Xinyu Chen
Zitao Li
Zhenyu Liu
L. Wang
Tong Lu
Baotian Hu
Min Zhang
OffRL
LRM
383
7
0
25 May 2025
Advancing Video Self-Supervised Learning via Image Foundation Models
Pattern Recognition Letters (Pattern Recogn. Lett.), 2025
Jingwei Wu
Zhewei Huang
Chang Liu
172
0
0
25 May 2025
Temporal Consistency Constrained Transferable Adversarial Attacks with Background Mixup for Action Recognition
International Joint Conference on Artificial Intelligence (IJCAI), 2025
Ping Li
Jianan Ni
Bo Pang
AAML
378
0
0
23 May 2025
Dual Branch VideoMamba with Gated Class Token Fusion for Violence Detection
Damith Chamalke Senadeera
Xiaoyun Yang
Shibo Li
Muhammad Awais
Dimitrios Kollias
Gregory G. Slabaugh
Mamba
161
1
0
23 May 2025
Temporal Object Captioning for Street Scene Videos from LiDAR Tracks
Vignesh Gopinathan
Urs Zimmermann
Michael Arnold
Matthias Rottmann
186
0
0
22 May 2025
Time to Embed: Unlocking Foundation Models for Time Series with Channel Descriptions
Utsav Dutta
Sina Khoshfetrat Pakazad
Henrik Ohlsson
AI4TS
AIFin
309
0
0
20 May 2025
This Time is Different: An Observability Perspective on Time Series Foundation Models
Ben Cohen
Emaad Khwaja
Youssef Doubli
Salahidine Lemaachi
Chris Lettieri
...
Zongzhe Xu
Viktoriya Zhukova
David Asker
Ameet Talwalkar
Othmane Abou-Amal
AI4TS
AI4CE
423
11
0
20 May 2025
BusterX: MLLM-Powered AI-Generated Video Forgery Detection and Explanation
Haiquan Wen
Yiwei He
Zhenglin Huang
Tianxiao Li
Zihan Yu
Xingru Huang
Lu Qi
Baoyuan Wu
Xuelong Li
Guangliang Cheng
VGen
427
4
0
19 May 2025
SpikeVideoFormer: An Efficient Spike-Driven Video Transformer with Hamming Attention and
O
(
T
)
\mathcal{O}(T)
O
(
T
)
Complexity
Shihao Zou
Qingfeng Li
Wei Ji
Jingjing Li
Yongkui Yang
Guoqi Li
Chao Dong
314
1
0
15 May 2025
A Comparative Study of Human Activity Recognition: Motion, Tactile, and multi-modal Approaches
Valerio Belcamino
Nhat Minh Dinh Le
Quan Khanh Luu
Alessandro Carfì
V. A. Ho
Fulvio Mastrogiovanni
236
1
0
13 May 2025
TSLFormer: A Lightweight Transformer Model for Turkish Sign Language Recognition Using Skeletal Landmarks
Kutay Ertürk
Furkan Altınışık
İrem Sarıaltın
Ömer Nezih Gerek
SLR
445
1
0
11 May 2025
Decentralized LoRA Augmented Transformer with Context-aware Multi-scale Feature Learning for Secured Eye Diagnosis
Md. Naimur Asif Borno
Md Sakib Hossain Shovon
MD Hanif Sikder
Iffat Firozy Rimi
Tahani Jaser Alahmadi
Mohammad Ali Moni
MedIm
144
0
0
11 May 2025
Apple: Toward General Active Perception via Reinforcement Learning
Tim Schneider
Cristiana de Farias
Roberto Calandra
Lawrence Yunliang Chen
Jan Peters
919
1
0
09 May 2025
Let Humanoids Hike! Integrative Skill Development on Complex Trails
Computer Vision and Pattern Recognition (CVPR), 2025
Kwan-Yee Lin
Stella X.Yu
308
3
0
09 May 2025
Task-Oriented Multimodal Token Transmission in Resource-Constrained Multiuser Networks
IEEE Wireless Communications Letters (WCL), 2025
Junhe Zhang
Wanli Ni
Pengwei Wang
Dongyu Wang
231
0
0
06 May 2025
Artificial Behavior Intelligence: Technology, Challenges, and Future Directions
International Conference on Human System Interaction (HSI), 2025
Kanghyun Jo
Jehwan Choi
Kwanho Kim
Seongmin Kim
Duy-Linh Nguyen
Xuan-Thuy Vo
Adri Priadana
Tien-Dat Tran
AI4CE
313
0
0
06 May 2025
Deep Learning for Sports Video Event Detection: Tasks, Datasets, Methods, and Challenges
Hao Xu
Arbind Agrahari Baniya
Sam Well
Mohamed Reda Bouadjenek
Richard Dazeley
S. Aryal
AI4TS
283
3
0
06 May 2025
DiffVQA: Video Quality Assessment Using Diffusion Feature Extractor
Wei-Ting Chen
Yu-Jiet Vong
Yi-Tsung Lee
Sy-Yen Kuo
Qiang Gao
Sizhuo Ma
Jian Wang
971
1
0
06 May 2025
seq-JEPA: Autoregressive Predictive Learning of Invariant-Equivariant World Models
Hafez Ghaemi
Eilif Muller
Shahab Bakhtiari
466
1
0
06 May 2025
Token Coordinated Prompt Attention is Needed for Visual Prompting
Zichen Liu
Xu Zou
Gang Hua
Jiahuan Zhou
505
3
0
05 May 2025
Previous
1
2
3
4
5
6
...
25
26
27
Next