ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2026 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2411.13626
  4. Cited By
Principles of Visual Tokens for Efficient Video Understanding
v1v2 (latest)

Principles of Visual Tokens for Efficient Video Understanding

20 November 2024
Xinyue Hao
Gen Li
Shreyank N. Gowda
Robert B Fisher
Jonathan Huang
Anurag Arnab
Laura Sevilla-Lara
ArXiv (abs)PDFHTML

Papers citing "Principles of Visual Tokens for Efficient Video Understanding"

46 / 46 papers shown
Title
LookupViT: Compressing visual information to a limited number of tokens
LookupViT: Compressing visual information to a limited number of tokens
Rajat Koner
Gagan Jain
Prateek Jain
Volker Tresp
Sujoy Paul
159
15
0
17 Jul 2024
vid-TLDR: Training Free Token merging for Light-weight Video Transformer
vid-TLDR: Training Free Token merging for Light-weight Video Transformer
Joonmyung Choi
Sanghyeok Lee
Jaewon Chu
Minhyuk Choi
Hyunwoo J. Kim
MoMeViT
274
38
0
20 Mar 2024
An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference
  Acceleration for Large Vision-Language Models
An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language ModelsEuropean Conference on Computer Vision (ECCV), 2024
Liang Chen
Haozhe Zhao
Tianyu Liu
Shuai Bai
Junyang Lin
Chang Zhou
Baobao Chang
MLLMVLM
315
318
0
11 Mar 2024
How does the primate brain combine generative and discriminative
  computations in vision?
How does the primate brain combine generative and discriminative computations in vision?
Benjamin Peters
J. DiCarlo
Todd Gureckis
Ralf Haefner
Leyla Isik
...
Kimberly Stachenfeld
Zenna Tavares
Doris Y. Tsao
Ilker Yildirim
N. Kriegeskorte
210
8
0
11 Jan 2024
Watt For What: Rethinking Deep Learning's Energy-Performance
  Relationship
Watt For What: Rethinking Deep Learning's Energy-Performance Relationship
Shreyank N. Gowda
Xinyue Hao
Gen Li
Laura Sevilla-Lara
Shashank Narayana Gowda
HAI
165
17
0
10 Oct 2023
Can I Trust Your Answer? Visually Grounded Video Question Answering
Can I Trust Your Answer? Visually Grounded Video Question AnsweringComputer Vision and Pattern Recognition (CVPR), 2023
Junbin Xiao
Angela Yao
Yicong Li
Tat-Seng Chua
292
106
0
04 Sep 2023
Prune Spatio-temporal Tokens by Semantic-aware Temporal Accumulation
Prune Spatio-temporal Tokens by Semantic-aware Temporal AccumulationIEEE International Conference on Computer Vision (ICCV), 2023
Shuangrui Ding
Peisen Zhao
Xiaopeng Zhang
Rui Qian
H. Xiong
Qi Tian
ViT
177
26
0
08 Aug 2023
Video-FocalNets: Spatio-Temporal Focal Modulation for Video Action
  Recognition
Video-FocalNets: Spatio-Temporal Focal Modulation for Video Action RecognitionIEEE International Conference on Computer Vision (ICCV), 2023
Syed Talal Wasim
Muhammad Uzair Khattak
Muzammal Naseer
Salman Khan
M. Shah
Fahad Shahbaz Khan
ViT
216
27
0
13 Jul 2023
How can objects help action recognition?
How can objects help action recognition?Computer Vision and Pattern Recognition (CVPR), 2023
Xingyi Zhou
Anurag Arnab
Chen Sun
Cordelia Schmid
198
25
0
20 Jun 2023
Revisiting Token Pruning for Object Detection and Instance Segmentation
Revisiting Token Pruning for Object Detection and Instance SegmentationIEEE Workshop/Winter Conference on Applications of Computer Vision (WACV), 2023
Yifei Liu
Mathias Gehrig
Nico Messikommer
Marco Cannici
Davide Scaramuzza
ViTVLM
375
52
0
12 Jun 2023
Self-Chained Image-Language Model for Video Localization and Question
  Answering
Self-Chained Image-Language Model for Video Localization and Question AnsweringNeural Information Processing Systems (NeurIPS), 2023
Shoubin Yu
Jaemin Cho
Prateek Yadav
Joey Tianyi Zhou
367
198
0
11 May 2023
AIM: Adapting Image Models for Efficient Video Action Recognition
AIM: Adapting Image Models for Efficient Video Action RecognitionInternational Conference on Learning Representations (ICLR), 2023
Taojiannan Yang
Yi Zhu
Yusheng Xie
Aston Zhang
Chong Chen
Mu Li
ViT
385
215
0
06 Feb 2023
Token Merging: Your ViT But Faster
Token Merging: Your ViT But FasterInternational Conference on Learning Representations (ICLR), 2022
Daniel Bolya
Cheng-Yang Fu
Xiaoliang Dai
Peizhao Zhang
Christoph Feichtenhofer
Judy Hoffman
MoMe
376
701
0
17 Oct 2022
Expanding Language-Image Pretrained Models for General Video Recognition
Expanding Language-Image Pretrained Models for General Video RecognitionEuropean Conference on Computer Vision (ECCV), 2022
Bolin Ni
Houwen Peng
Minghao Chen
Songyang Zhang
Gaofeng Meng
Jianlong Fu
Shiming Xiang
Haibin Ling
VLMCLIPViT
255
425
0
04 Aug 2022
ST-Adapter: Parameter-Efficient Image-to-Video Transfer Learning
ST-Adapter: Parameter-Efficient Image-to-Video Transfer LearningNeural Information Processing Systems (NeurIPS), 2022
Junting Pan
Ziyi Lin
Xiatian Zhu
Jing Shao
Jiaming Song
320
259
0
27 Jun 2022
Revisiting the "Video" in Video-Language Understanding
Revisiting the "Video" in Video-Language UnderstandingComputer Vision and Pattern Recognition (CVPR), 2022
S. Buch
Cristobal Eyzaguirre
Adrien Gaidon
Jiajun Wu
L. Fei-Fei
Juan Carlos Niebles
193
200
0
03 Jun 2022
The Carbon Footprint of Machine Learning Training Will Plateau, Then
  Shrink
The Carbon Footprint of Machine Learning Training Will Plateau, Then Shrink
David A. Patterson
Joseph E. Gonzalez
Urs Holzle
Quoc V. Le
Chen Liang
Lluís-Miquel Munguía
D. Rothchild
David R. So
Maud Texier
J. Dean
AI4CE
261
336
0
11 Apr 2022
VideoMAE: Masked Autoencoders are Data-Efficient Learners for
  Self-Supervised Video Pre-Training
VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-TrainingNeural Information Processing Systems (NeurIPS), 2022
Zhan Tong
Yibing Song
Jue Wang
Limin Wang
ViT
656
1,595
0
23 Mar 2022
MeMViT: Memory-Augmented Multiscale Vision Transformer for Efficient
  Long-Term Video Recognition
MeMViT: Memory-Augmented Multiscale Vision Transformer for Efficient Long-Term Video RecognitionComputer Vision and Pattern Recognition (CVPR), 2022
Chao-Yuan Wu
Yanghao Li
K. Mangalam
Haoqi Fan
Bo Xiong
Jitendra Malik
Christoph Feichtenhofer
ViT
377
242
0
20 Jan 2022
AdaViT: Adaptive Tokens for Efficient Vision Transformer
AdaViT: Adaptive Tokens for Efficient Vision Transformer
Hongxu Yin
Arash Vahdat
J. Álvarez
Arun Mallya
Jan Kautz
Pavlo Molchanov
ViT
558
438
0
14 Dec 2021
Efficient Video Transformers with Spatial-Temporal Token Selection
Efficient Video Transformers with Spatial-Temporal Token Selection
Junke Wang
Xitong Yang
Hengduo Li
Li Liu
Zuxuan Wu
Yu-Gang Jiang
ViT
158
81
0
23 Nov 2021
Video Swin Transformer
Video Swin Transformer
Ze Liu
Jia Ning
Yue Cao
Yixuan Wei
Zheng Zhang
Stephen Lin
Han Hu
ViT
377
1,835
0
24 Jun 2021
TokenLearner: What Can 8 Learned Tokens Do for Images and Videos?
TokenLearner: What Can 8 Learned Tokens Do for Images and Videos?
Michael S. Ryoo
A. Piergiovanni
Anurag Arnab
Mostafa Dehghani
A. Angelova
ViT
521
152
0
21 Jun 2021
Space-time Mixing Attention for Video Transformer
Space-time Mixing Attention for Video TransformerNeural Information Processing Systems (NeurIPS), 2021
Adrian Bulat
Juan-Manuel Perez-Rua
Swathikiran Sudhakaran
Brais Martínez
Georgios Tzimiropoulos
ViT
269
141
0
10 Jun 2021
Keeping Your Eye on the Ball: Trajectory Attention in Video Transformers
Keeping Your Eye on the Ball: Trajectory Attention in Video TransformersNeural Information Processing Systems (NeurIPS), 2021
Mandela Patrick
Dylan Campbell
Yuki M. Asano
Ishan Misra
Ishan Misra Florian Metze
Christoph Feichtenhofer
Andrea Vedaldi
João F. Henriques
260
338
0
09 Jun 2021
DynamicViT: Efficient Vision Transformers with Dynamic Token
  Sparsification
DynamicViT: Efficient Vision Transformers with Dynamic Token SparsificationNeural Information Processing Systems (NeurIPS), 2021
Yongming Rao
Wenliang Zhao
Benlin Liu
Jiwen Lu
Jie Zhou
Cho-Jui Hsieh
ViT
397
900
0
03 Jun 2021
Multiscale Vision Transformers
Multiscale Vision TransformersIEEE International Conference on Computer Vision (ICCV), 2021
Haoqi Fan
Bo Xiong
K. Mangalam
Yanghao Li
Zhicheng Yan
Jitendra Malik
Christoph Feichtenhofer
ViT
431
1,487
0
22 Apr 2021
ViViT: A Video Vision Transformer
ViViT: A Video Vision TransformerIEEE International Conference on Computer Vision (ICCV), 2021
Anurag Arnab
Mostafa Dehghani
G. Heigold
Chen Sun
Mario Lucic
Cordelia Schmid
ViT
458
2,651
0
29 Mar 2021
Swin Transformer: Hierarchical Vision Transformer using Shifted Windows
Swin Transformer: Hierarchical Vision Transformer using Shifted WindowsIEEE International Conference on Computer Vision (ICCV), 2021
Ze Liu
Yutong Lin
Yue Cao
Han Hu
Yixuan Wei
Zheng Zhang
Stephen Lin
B. Guo
ViT
1.8K
27,905
0
25 Mar 2021
Is Space-Time Attention All You Need for Video Understanding?
Is Space-Time Attention All You Need for Video Understanding?International Conference on Machine Learning (ICML), 2021
Gedas Bertasius
Heng Wang
Lorenzo Torresani
ViT
1.0K
2,596
0
09 Feb 2021
Video Transformer Network
Video Transformer Network
Daniel Neimark
Omri Bar
Maya Zohar
Dotan Asselmann
ViT
705
472
0
01 Feb 2021
Training data-efficient image transformers & distillation through
  attention
Training data-efficient image transformers & distillation through attentionInternational Conference on Machine Learning (ICML), 2020
Hugo Touvron
Matthieu Cord
Matthijs Douze
Francisco Massa
Alexandre Sablayrolles
Edouard Grave
ViT
621
8,168
0
23 Dec 2020
SMART Frame Selection for Action Recognition
SMART Frame Selection for Action RecognitionAAAI Conference on Artificial Intelligence (AAAI), 2020
Shreyank N. Gowda
Marcus Rohrbach
Laura Sevilla-Lara
217
162
0
19 Dec 2020
An Image is Worth 16x16 Words: Transformers for Image Recognition at
  Scale
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
Alexey Dosovitskiy
Lucas Beyer
Alexander Kolesnikov
Dirk Weissenborn
Xiaohua Zhai
...
Matthias Minderer
G. Heigold
Sylvain Gelly
Jakob Uszkoreit
N. Houlsby
ViT
1.3K
54,134
0
22 Oct 2020
X3D: Expanding Architectures for Efficient Video Recognition
X3D: Expanding Architectures for Efficient Video RecognitionComputer Vision and Pattern Recognition (CVPR), 2020
Christoph Feichtenhofer
360
1,209
0
09 Apr 2020
Only Time Can Tell: Discovering Temporal Data for Temporal Modeling
Only Time Can Tell: Discovering Temporal Data for Temporal ModelingIEEE Workshop/Winter Conference on Applications of Computer Vision (WACV), 2019
Laura Sevilla-Lara
Shengxin Cindy Zha
Zhicheng Yan
Vedanuj Goswami
Matt Feiszli
Lorenzo Torresani
237
82
0
19 Jul 2019
Pyramid Feature Attention Network for Saliency detection
Pyramid Feature Attention Network for Saliency detectionComputer Vision and Pattern Recognition (CVPR), 2019
Ting Zhao
Xiangqian Wu
201
661
0
01 Mar 2019
SlowFast Networks for Video Recognition
SlowFast Networks for Video Recognition
Christoph Feichtenhofer
Haoqi Fan
Jitendra Malik
Kaiming He
520
3,809
0
10 Dec 2018
ECO: Efficient Convolutional Network for Online Video Understanding
ECO: Efficient Convolutional Network for Online Video Understanding
Mohammadreza Zolfaghari
Kamaljeet Singh
Thomas Brox
323
524
0
24 Apr 2018
Compressed Video Action Recognition
Compressed Video Action Recognition
Chao-Yuan Wu
Manzil Zaheer
Hexiang Hu
R. Manmatha
Alex Smola
Philipp Krahenbuhl
318
347
0
02 Dec 2017
The "something something" video database for learning and evaluating
  visual common sense
The "something something" video database for learning and evaluating visual common senseIEEE International Conference on Computer Vision (ICCV), 2017
Raghav Goyal
Samira Ebrahimi Kahou
Vincent Michalski
Joanna Materzynska
S. Westphal
...
Moritz Mueller-Freitag
F. Hoppe
Christian Thurau
Ingo Bax
Roland Memisevic
VLM
368
1,762
0
13 Jun 2017
Attention Is All You Need
Attention Is All You NeedNeural Information Processing Systems (NeurIPS), 2017
Ashish Vaswani
Noam M. Shazeer
Niki Parmar
Jakob Uszkoreit
Llion Jones
Aidan Gomez
Lukasz Kaiser
Illia Polosukhin
3DV
2.8K
159,241
0
12 Jun 2017
AVA: A Video Dataset of Spatio-temporally Localized Atomic Visual
  Actions
AVA: A Video Dataset of Spatio-temporally Localized Atomic Visual Actions
Chunhui Gu
Chen Sun
David A. Ross
Carl Vondrick
C. Pantofaru
...
G. Toderici
Susanna Ricco
Rahul Sukthankar
Cordelia Schmid
Jitendra Malik
VGen
446
1,125
0
23 May 2017
Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset
Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset
João Carreira
Andrew Zisserman
720
8,952
0
22 May 2017
Grad-CAM: Visual Explanations from Deep Networks via Gradient-based
  Localization
Grad-CAM: Visual Explanations from Deep Networks via Gradient-based LocalizationInternational Journal of Computer Vision (IJCV), 2016
Ramprasaath R. Selvaraju
Michael Cogswell
Abhishek Das
Ramakrishna Vedantam
Devi Parikh
Dhruv Batra
FAtt
896
23,895
0
07 Oct 2016
UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild
UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild
K. Soomro
Amir Zamir
M. Shah
CLIPVGen
841
6,758
0
03 Dec 2012
1