Communities
Connect sessions
AI calendar
Organizations
Join Slack
Contact Sales
Search
Open menu
Home
Papers
2103.15691
Cited By
v1
v2 (latest)
ViViT: A Video Vision Transformer
IEEE International Conference on Computer Vision (ICCV), 2021
29 March 2021
Anurag Arnab
Mostafa Dehghani
G. Heigold
Chen Sun
Mario Lucic
Cordelia Schmid
ViT
Re-assign community
ArXiv (abs)
PDF
HTML
HuggingFace (3 upvotes)
Github (3544★)
Papers citing
"ViViT: A Video Vision Transformer"
50 / 1,311 papers shown
Automatic Scene Generation: State-of-the-Art Techniques, Models, Datasets, Challenges, and Future Prospects
IEEE Access (IEEE Access), 2024
Awal Ahmed Fime
Saifuddin Mahmud
Arpita Das
Md. Sunzidul Islam
Hong-Hoon Kim
VGen
3DV
273
2
0
14 Sep 2024
TabMixer: Noninvasive Estimation of the Mean Pulmonary Artery Pressure via Imaging and Tabular Data Mixing
International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI), 2024
Michal K. Grzeszczyk
Przemysław Korzeniowski
S. Alabed
Andrew J Swift
Tomasz Trzciñski
Arkadiusz Sitek
138
2
0
11 Sep 2024
Data Collection-free Masked Video Modeling
European Conference on Computer Vision (ECCV), 2024
Yuchi Ishikawa
Masayoshi Kondo
Yoshimitsu Aoki
ViT
210
1
0
10 Sep 2024
Scaling Law Hypothesis for Multimodal Model
Qingyun Sun
Zhen Guo
160
0
0
10 Sep 2024
Self-Supervised Contrastive Learning for Videos using Differentiable Local Alignment
Keyne Oei
Amr Gomaa
Anna Maria Feit
João Belo
329
1
0
06 Sep 2024
Spindle: Efficient Distributed Training of Multi-Task Large Models via Wavefront Scheduling
International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2024
Yujie Wang
Shenhan Zhu
Fangcheng Fu
Xupeng Miao
Jie Zhang
Juan Zhu
Fan Hong
Yongbin Li
Bin Cui
156
0
0
05 Sep 2024
Dynamic Motion Synthesis: Masked Audio-Text Conditioned Spatio-Temporal Transformers
Sohan Anisetty
James Hays
200
1
0
03 Sep 2024
3D-LSPTM: An Automatic Framework with 3D-Large-Scale Pretrained Model for Laryngeal Cancer Detection Using Laryngoscopic Videos
Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), 2024
Meiyu Qiu
Yongqian Li
Wenjun Huang
Haoyun Zhang
Weiping Zheng
Wenbin Lei
Xiaomao Fan
169
0
0
02 Sep 2024
ESP-PCT: Enhanced VR Semantic Performance through Efficient Compression of Temporal and Spatial Redundancies in Point Cloud Transformers
International Joint Conference on Artificial Intelligence (IJCAI), 2024
Luoyu Mei
Shuai Wang
Yun Cheng
Ruofeng Liu
Zhimeng Yin
Wenchao Jiang
Shuai Wang
Wei Gong
225
10
0
02 Sep 2024
Learning to Discover Forgery Cues for Face Forgery Detection
IEEE Transactions on Information Forensics and Security (IEEE TIFS), 2024
Jiahe Tian
Peng-Wen Chen
Cai Yu
Xiaomeng Fu
Xi Wang
Jiao Dai
Jizhong Han
CVBM
AAML
262
18
0
02 Sep 2024
Towards Student Actions in Classroom Scenes: New Dataset and Baseline
IEEE transactions on multimedia (IEEE TMM), 2024
Zhuolin Tan
Chenqiang Gao
Anyong Qin
Ruixin Chen
Tiecheng Song
Feng Yang
Deyu Meng
302
1
0
02 Sep 2024
RoboMNIST: A Multimodal Dataset for Multi-Robot Activity Recognition Using WiFi Sensing, Video, and Audio
Scientific Data (Sci Data), 2024
Kian Behzad
Rojin Zandi
Elaheh Motamedi
Hojjat Salehinejad
Milad Siami
175
11
0
29 Aug 2024
DEAR: Depth-Enhanced Action Recognition
Sadegh Rahmaniboldaji
Filip Rybansky
Quoc Vuong
Frank Guerin
Andrew Gilbert
130
1
0
28 Aug 2024
Fine-grained length controllable video captioning with ordinal embeddings
IEEE Access (IEEE Access), 2024
Tomoya Nitta
Takumi Fukuzawa
Toru Tamaki
316
0
0
27 Aug 2024
MMASD+: A Novel Dataset for Privacy-Preserving Behavior Analysis of Children with Autism Spectrum Disorder
Pavan Uttej Ravva
Behdokht Kiafar
Pinar Kullu
Jicheng Li
Anjana Bhat
R. Barmaki
229
1
0
27 Aug 2024
SurGen: Text-Guided Diffusion Model for Surgical Video Generation
Joseph Cho
Samuel Schmidgall
C. Zakka
Mrudang Mathur
Dhamanpreet Kaur
R. Shad
W. Hiesinger
VGen
MedIm
308
18
0
26 Aug 2024
Frame Order Matters: A Temporal Sequence-Aware Model for Few-Shot Action Recognition
AAAI Conference on Artificial Intelligence (AAAI), 2024
Bozheng Li
Mushui Liu
Gaoang Wang
Yunlong Yu
201
11
0
22 Aug 2024
EdgeNAT: Transformer for Efficient Edge Detection
European Conference on Artificial Intelligence (ECAI), 2024
Jinghuai Jie
Yan Guo
Guixing Wu
Junmin Wu
Baojian Hua
ViT
182
6
0
20 Aug 2024
ALS-HAR: Harnessing Wearable Ambient Light Sensors to Enhance IMU-based Human Activity Recogntion
L. Ray
Daniel Geißler
Mengxi Liu
Bo Zhou
Sungho Suh
Paul Lukowicz
167
0
0
18 Aug 2024
VrdONE: One-stage Video Visual Relation Detection
ACM Multimedia (MM), 2024
Xinjie Jiang
Chenxi Zheng
Xuemiao Xu
Bangzhen Liu
Weiying Zheng
Huaidong Zhang
Shengfeng He
VGen
VOS
215
5
0
18 Aug 2024
OmniCLIP: Adapting CLIP for Video Recognition with Spatial-Temporal Omni-Scale Feature Learning
European Conference on Artificial Intelligence (ECAI), 2024
Mushui Liu
Bozheng Li
Yunlong Yu
VLM
240
14
0
12 Aug 2024
How Does Audio Influence Visual Attention in Omnidirectional Videos? Database and Model
IEEE Transactions on Image Processing (TIP), 2024
Yuxin Zhu
Huiyu Duan
Kaiwei Zhang
Yucheng Zhu
Xilei Zhu
Long Teng
Xiongkuo Min
Guangtao Zhai
270
7
0
10 Aug 2024
MU-MAE: Multimodal Masked Autoencoders-Based One-Shot Learning
Conference on Multimedia Information Processing and Retrieval (MIPR), 2024
Rex Liu
Xin Liu
264
1
0
08 Aug 2024
Surgformer: Surgical Transformer with Hierarchical Temporal Attention for Surgical Phase Recognition
International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI), 2024
Shu Yang
Luyang Luo
Qiong Wang
Hao Chen
MedIm
165
23
0
07 Aug 2024
From Recognition to Prediction: Leveraging Sequence Reasoning for Action Anticipation
Xin Liu
Chao Hao
Zitong Yu
Huanjing Yue
Jingyu Yang
257
2
0
05 Aug 2024
BioMamba: A Pre-trained Biomedical Language Representation Model Leveraging Mamba
Ling Yue
Sixue Xing
Yingzhou Lu
Tianfan Fu
Mamba
AI4CE
233
11
0
05 Aug 2024
FE-Adapter: Adapting Image-based Emotion Classifiers to Videos
IEEE International Conference on Automatic Face & Gesture Recognition (FG), 2024
Shreyank N. Gowda
Boyan Gao
David A. Clifton
189
10
0
05 Aug 2024
VidModEx: Interpretable and Efficient Black Box Model Extraction for High-Dimensional Spaces
Somnath Sendhil Kumar
Yuvaraj Govindarajulu
Pavan Kulkarni
Manojkumar Somabhai Parmar
FAtt
203
1
0
04 Aug 2024
A Survey of Mamba
Shuwei Shi
Shibing Chu
Rui An
Wenqi Fan
Yuee Xie
Hui Liu
Yuanping Chen
Qing Li
AI4CE
508
75
0
02 Aug 2024
MPT-PAR:Mix-Parameters Transformer for Panoramic Activity Recognition
Wenqing Gan
Yaoyu Li
Jian Li
Zhangang Lin
ViT
261
1
0
01 Aug 2024
RainMamba: Enhanced Locality Learning with State Space Models for Video Deraining
Hongtao Wu
Yijun Yang
Huihui Xu
Weiming Wang
Jinni Zhou
Lei Zhu
Mamba
217
61
0
31 Jul 2024
PEAR: Phrase-Based Hand-Object Interaction Anticipation
Zichen Zhang
Hongcheng Luo
Wei Zhai
N. A. Ushakov
Yu Kang
316
10
0
31 Jul 2024
Dynamic Gesture Recognition in Ultra-Range Distance for Effective Human-Robot Interaction
Eran Bamani Beeri
Eden Nissinman
A. Sintov
116
0
0
31 Jul 2024
SpotFormer: Multi-Scale Spatio-Temporal Transformer for Facial Expression Spotting
Yicheng Deng
Hideaki Hayashi
Hajime Nagahara
288
3
0
30 Jul 2024
Mixture of Nested Experts: Adaptive Processing of Visual Tokens
Neural Information Processing Systems (NeurIPS), 2024
Gagan Jain
Nidhi Hegde
Aditya Kusupati
Arsha Nagrani
Shyamal Buch
Prateek Jain
Anurag Arnab
Sujoy Paul
MoE
273
17
0
29 Jul 2024
Language-driven Grasp Detection with Mask-guided Attention
IEEE/RJS International Conference on Intelligent RObots and Systems (IROS), 2024
Tuan V. Vo
M. Vu
Baoru Huang
An Vuong
Ngan Le
T. Vo
Anh Nguyen
191
6
0
29 Jul 2024
MultiHateClip: A Multilingual Benchmark Dataset for Hateful Video Detection on YouTube and Bilibili
ACM Multimedia (MM), 2024
Han Wang
Tan Rui Yang
Usman Naseem
Roy Ka-wei Lee
267
23
0
28 Jul 2024
Trajectory-aligned Space-time Tokens for Few-shot Action Recognition
Pulkit Kumar
Namitha Padmanabhan
Luke Luo
Sai Saketh Rambhatla
Abhinav Shrivastava
242
7
0
25 Jul 2024
OVR: A Dataset for Open Vocabulary Temporal Repetition Counting in Videos
Debidatta Dwibedi
Y. Aytar
Jonathan Tompson
Andrew Zisserman
157
6
0
24 Jul 2024
Causal Understanding For Video Question Answering
Bhanu Prakash Reddy Guda
Tanmay Kulkarni
Adithya Sampath
Swarnashree Mysore Sathyendra
CML
277
0
0
23 Jul 2024
Probing Fine-Grained Action Understanding and Cross-View Generalization of Foundation Models
Thinesh Thiyakesan Ponbagavathi
Kunyu Peng
Alina Roitberg
280
2
0
22 Jul 2024
SIGMA:Sinkhorn-Guided Masked Video Modeling
Mohammadreza Salehi
Michael Dorkenwald
Fida Mohammad Thoker
E. Gavves
Cees G. M. Snoek
Yuki M. Asano
255
14
0
22 Jul 2024
Towards Robust Vision Transformer via Masked Adaptive Ensemble
Fudong Lin
Jiadong Lou
Xu Yuan
Nianfeng Tzeng
ViT
AAML
287
3
0
22 Jul 2024
A Comprehensive Review of Few-shot Action Recognition
Yuyang Wanyan
Xiaoshan Yang
Weiming Dong
Changsheng Xu
VLM
538
13
0
20 Jul 2024
Pose-guided multi-task video transformer for driver action recognition
Ricardo Pizarro
Roberto Valle
L. Bergasa
J. M. Buenaposada
Luis Baumela
ViT
194
1
0
18 Jul 2024
DiffuX2CT: Diffusion Learning to Reconstruct CT Images from Biplanar X-Rays
Xuhui Liu
Zhi Qiao
Runkun Liu
Hong Li
Juan Zhang
Xiantong Zhen
Zhen Qian
Baochang Zhang
MedIm
192
7
0
18 Jul 2024
Enhancing Temporal Action Localization: Advanced S6 Modeling with Recurrent Mechanism
Sangyoun Lee
Juho Jung
Changdae Oh
Sunghee Yun
223
6
0
18 Jul 2024
Retrieval-Enhanced Machine Learning: Synthesis and Opportunities
To Eun Kim
Alireza Salemi
Andrew Drozdov
Fernando Diaz
Hamed Zamani
367
10
0
17 Jul 2024
Hierarchical Separable Video Transformer for Snapshot Compressive Imaging
Ping Wang
Yulun Zhang
Lishun Wang
Xin Yuan
ViT
415
4
0
16 Jul 2024
Human-Centric Transformer for Domain Adaptive Action Recognition
Kun-Yu Lin
Jiaming Zhou
Wei-Shi Zheng
225
10
0
15 Jul 2024
Previous
1
2
3
...
6
7
8
...
25
26
27
Next