Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2103.13915
Cited By
An Image is Worth 16x16 Words, What is a Video Worth?
25 March 2021
Gilad Sharir
Asaf Noy
Lihi Zelnik-Manor
ViT
Re-assign community
ArXiv
PDF
HTML
Papers citing
"An Image is Worth 16x16 Words, What is a Video Worth?"
14 / 14 papers shown
Title
The Moon's Many Faces: A Single Unified Transformer for Multimodal Lunar Reconstruction
Tom Sander
Moritz Tenthoff
Kay Wohlfarth
Christian Wöhler
19
0
0
08 May 2025
Position: Foundation Models Need Digital Twin Representations
Yiqing Shen
Hao Ding
Lalithkumar Seenivasan
Tianmin Shu
Mathias Unberath
AI4CE
29
0
0
01 May 2025
Enhancing Video Understanding: Deep Neural Networks for Spatiotemporal Analysis
Amir Hosein Fadaei
M. Dehaqani
36
0
0
11 Feb 2025
TeD-Loc: Text Distillation for Weakly Supervised Object Localization
Shakeeb Murtaza
Soufiane Belharbi
M. Pedersoli
Eric Granger
WSOL
VLM
86
1
0
22 Jan 2025
Video LLMs for Temporal Reasoning in Long Videos
Fawad Javed Fateh
Umer Ahmed
Hamza Khan
M. Zia
Quoc-Huy Tran
VLM
77
0
0
04 Dec 2024
StimuVAR: Spatiotemporal Stimuli-aware Video Affective Reasoning with Multimodal Large Language Models
Y. Guo
Faizan Siddiqui
Yang Zhao
Rama Chellappa
Shao-Yuan Lo
LRM
24
2
0
31 Aug 2024
MPCFormer: fast, performant and private Transformer inference with MPC
Dacheng Li
Rulin Shao
Hongyi Wang
Han Guo
Eric P. Xing
Haotong Zhang
9
77
0
02 Nov 2022
Jointformer: Single-Frame Lifting Transformer with Error Prediction and Refinement for 3D Human Pose Estimation
Sebastian Lutz
R. Blythman
Koustav Ghosal
Matthew Moynihan
C. Simms
A. Smolic
ViT
15
15
0
07 Aug 2022
UniFormer: Unifying Convolution and Self-attention for Visual Recognition
Kunchang Li
Yali Wang
Junhao Zhang
Peng Gao
Guanglu Song
Yu Liu
Hongsheng Li
Yu Qiao
ViT
142
360
0
24 Jan 2022
UniFormer: Unified Transformer for Efficient Spatiotemporal Representation Learning
Kunchang Li
Yali Wang
Peng Gao
Guanglu Song
Yu Liu
Hongsheng Li
Yu Qiao
ViT
12
235
0
12 Jan 2022
Self-supervised Video Transformer
Kanchana Ranasinghe
Muzammal Naseer
Salman Khan
F. Khan
Michael S. Ryoo
ViT
16
84
0
02 Dec 2021
ATISS: Autoregressive Transformers for Indoor Scene Synthesis
Despoina Paschalidou
Amlan Kar
Maria Shugrina
Karsten Kreis
Andreas Geiger
Sanja Fidler
3DV
ViT
25
147
0
07 Oct 2021
ImageNet-21K Pretraining for the Masses
T. Ridnik
Emanuel Ben-Baruch
Asaf Noy
Lihi Zelnik-Manor
SSeg
VLM
CLIP
154
676
0
22 Apr 2021
ECO: Efficient Convolutional Network for Online Video Understanding
Mohammadreza Zolfaghari
Kamaljeet Singh
Thomas Brox
119
495
0
24 Apr 2018
1