ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2112.01526
  4. Cited By
MViTv2: Improved Multiscale Vision Transformers for Classification and
  Detection

MViTv2: Improved Multiscale Vision Transformers for Classification and Detection

2 December 2021
Yanghao Li
Chaoxia Wu
Haoqi Fan
K. Mangalam
Bo Xiong
Jitendra Malik
Christoph Feichtenhofer
    ViT
ArXivPDFHTML

Papers citing "MViTv2: Improved Multiscale Vision Transformers for Classification and Detection"

50 / 395 papers shown
Title
Agent Attention: On the Integration of Softmax and Linear Attention
Agent Attention: On the Integration of Softmax and Linear Attention
Dongchen Han
Tianzhu Ye
Yizeng Han
Zhuofan Xia
Siyuan Pan
Pengfei Wan
Shiji Song
Gao Huang
19
73
0
14 Dec 2023
Factorization Vision Transformer: Modeling Long Range Dependency with
  Local Window Cost
Factorization Vision Transformer: Modeling Long Range Dependency with Local Window Cost
Haolin Qin
Daquan Zhou
Tingfa Xu
Ziyang Bian
Jianan Li
19
9
0
14 Dec 2023
Just Add $π$! Pose Induced Video Transformers for Understanding
  Activities of Daily Living
Just Add πππ! Pose Induced Video Transformers for Understanding Activities of Daily Living
Dominick Reilly
Srijan Das
ViT
25
17
0
30 Nov 2023
Semantic-Aware Frame-Event Fusion based Pattern Recognition via Large
  Vision-Language Models
Semantic-Aware Frame-Event Fusion based Pattern Recognition via Large Vision-Language Models
Dong Li
Jiandong Jin
Yuhao Zhang
Yanlin Zhong
Yaoyang Wu
Lan Chen
Xiao Wang
Bin Luo
58
5
0
30 Nov 2023
Overcoming Label Noise for Source-free Unsupervised Video Domain
  Adaptation
Overcoming Label Noise for Source-free Unsupervised Video Domain Adaptation
A. Dasgupta
C. V. Jawahar
Karteek Alahari
TTA
VLM
11
10
0
30 Nov 2023
Spacewalk-18: A Benchmark for Multimodal and Long-form Procedural Video Understanding in Novel Domains
Spacewalk-18: A Benchmark for Multimodal and Long-form Procedural Video Understanding in Novel Domains
Rohan Myer Krishnan
Zitian Tang
Zhiqiu Yu
Chen Sun
37
1
0
30 Nov 2023
GeoDeformer: Geometric Deformable Transformer for Action Recognition
GeoDeformer: Geometric Deformable Transformer for Action Recognition
Jinhui Ye
Jiaming Zhou
Hui Xiong
Junwei Liang
ViT
13
1
0
29 Nov 2023
PViT-6D: Overclocking Vision Transformers for 6D Pose Estimation with
  Confidence-Level Prediction and Pose Tokens
PViT-6D: Overclocking Vision Transformers for 6D Pose Estimation with Confidence-Level Prediction and Pose Tokens
Sebastian Stapf
Tobias Bauernfeind
Marco Riboldi
ViT
13
1
0
29 Nov 2023
Object-based (yet Class-agnostic) Video Domain Adaptation
Object-based (yet Class-agnostic) Video Domain Adaptation
Dantong Niu
Amir Bar
Roei Herzig
Trevor Darrell
Anna Rohrbach
22
1
0
29 Nov 2023
Full-resolution MLPs Empower Medical Dense Prediction
Full-resolution MLPs Empower Medical Dense Prediction
Mingyuan Meng
Yuxin Xue
Da-wei Feng
Lei Bi
Jinman Kim
MedIm
15
4
0
28 Nov 2023
UniRepLKNet: A Universal Perception Large-Kernel ConvNet for Audio,
  Video, Point Cloud, Time-Series and Image Recognition
UniRepLKNet: A Universal Perception Large-Kernel ConvNet for Audio, Video, Point Cloud, Time-Series and Image Recognition
Xiaohan Ding
Yiyuan Zhang
Yixiao Ge
Sijie Zhao
Lin Song
Xiangyu Yue
Ying Shan
VLM
AI4TS
SSL
21
98
0
27 Nov 2023
Advancing Vision Transformers with Group-Mix Attention
Advancing Vision Transformers with Group-Mix Attention
Chongjian Ge
Xiaohan Ding
Zhan Tong
Li Yuan
Jiangliu Wang
Yibing Song
Ping Luo
112
16
0
26 Nov 2023
Mug-STAN: Adapting Image-Language Pretrained Models for General Video
  Understanding
Mug-STAN: Adapting Image-Language Pretrained Models for General Video Understanding
Ruyang Liu
Jingjia Huang
Wei-Nan Gao
Thomas H. Li
Ge Li
VLM
27
3
0
25 Nov 2023
Window Attention is Bugged: How not to Interpolate Position Embeddings
Window Attention is Bugged: How not to Interpolate Position Embeddings
Daniel Bolya
Chaitanya K. Ryali
Judy Hoffman
Christoph Feichtenhofer
24
10
0
09 Nov 2023
CCMR: High Resolution Optical Flow Estimation via Coarse-to-Fine
  Context-Guided Motion Reasoning
CCMR: High Resolution Optical Flow Estimation via Coarse-to-Fine Context-Guided Motion Reasoning
Azin Jahedi
Maximilian Luz
Marc Rivinius
Andrés Bruhn
16
2
0
05 Nov 2023
P-Age: Pexels Dataset for Robust Spatio-Temporal Apparent Age
  Classification
P-Age: Pexels Dataset for Robust Spatio-Temporal Apparent Age Classification
Abid Ali
Ashish Marisetty
François Brémond
27
6
0
04 Nov 2023
Scattering Vision Transformer: Spectral Mixing Matters
Scattering Vision Transformer: Spectral Mixing Matters
Badri N. Patro
Vijay Srinivas Agneeswaran
13
14
0
02 Nov 2023
Distilling Knowledge from CNN-Transformer Models for Enhanced Human
  Action Recognition
Distilling Knowledge from CNN-Transformer Models for Enhanced Human Action Recognition
Hamid Ahmadabadi
Omid Nejati Manzari
Ahmad Ayatollahi
14
7
0
02 Nov 2023
ProBio: A Protocol-guided Multimodal Dataset for Molecular Biology Lab
ProBio: A Protocol-guided Multimodal Dataset for Molecular Biology Lab
Jieming Cui
Ziren Gong
Baoxiong Jia
Siyuan Huang
Zilong Zheng
Jianzhu Ma
Yixin Zhu
20
3
0
01 Nov 2023
Object-centric Video Representation for Long-term Action Anticipation
Object-centric Video Representation for Long-term Action Anticipation
Ce Zhang
Changcheng Fu
Shijie Wang
Nakul Agarwal
Kwonjoon Lee
Chiho Choi
Chen Sun
15
14
0
31 Oct 2023
Limited Data, Unlimited Potential: A Study on ViTs Augmented by Masked
  Autoencoders
Limited Data, Unlimited Potential: A Study on ViTs Augmented by Masked Autoencoders
Srijan Das
Tanmay Jain
Dominick Reilly
P. Balaji
Soumyajit Karmakar
Shyam Marjit
Xiang Li
Abhijit Das
Michael S. Ryoo
22
16
0
31 Oct 2023
TESTA: Temporal-Spatial Token Aggregation for Long-form Video-Language
  Understanding
TESTA: Temporal-Spatial Token Aggregation for Long-form Video-Language Understanding
Shuhuai Ren
Sishuo Chen
Shicheng Li
Xu Sun
Lu Hou
ViT
29
28
0
29 Oct 2023
IndustReal: A Dataset for Procedure Step Recognition Handling Execution
  Errors in Egocentric Videos in an Industrial-Like Setting
IndustReal: A Dataset for Procedure Step Recognition Handling Execution Errors in Egocentric Videos in an Industrial-Like Setting
Tim J. Schoonbeek
Tim Houben
H. Onvlee
Peter H. N. de With
Fons van der Sommen
39
22
0
26 Oct 2023
Perceptual MAE for Image Manipulation Localization: A High-level Vision
  Learner Focusing on Low-level Features
Perceptual MAE for Image Manipulation Localization: A High-level Vision Learner Focusing on Low-level Features
Xiaochen Ma
Jizhe Zhou
Xiong Xu
Zhuohang Jiang
Chi-Man Pun
18
0
0
10 Oct 2023
Low-Resolution Self-Attention for Semantic Segmentation
Low-Resolution Self-Attention for Semantic Segmentation
Yu-Huan Wu
Shi-Chen Zhang
Yun-Hai Liu
Le Zhang
Xin Zhan
Daquan Zhou
Jiashi Feng
Ming-Ming Cheng
Liangli Zhen
ViT
32
3
0
08 Oct 2023
Prompt-to-OS (P2OS): Revolutionizing Operating Systems and
  Human-Computer Interaction with Integrated AI Generative Models
Prompt-to-OS (P2OS): Revolutionizing Operating Systems and Human-Computer Interaction with Integrated AI Generative Models
Gabriele Tolomei
Cesare Campagnano
Fabrizio Silvestri
Giovanni Trappolini
11
4
0
07 Oct 2023
ZeroI2V: Zero-Cost Adaptation of Pre-trained Transformers from Image to
  Video
ZeroI2V: Zero-Cost Adaptation of Pre-trained Transformers from Image to Video
Xinhao Li
Yuhan Zhu
Limin Wang
VLM
27
8
0
02 Oct 2023
Win-Win: Training High-Resolution Vision Transformers from Two Windows
Win-Win: Training High-Resolution Vision Transformers from Two Windows
Vincent Leroy
Jérôme Revaud
Thomas Lucas
Philippe Weinzaepfel
ViT
27
2
0
01 Oct 2023
A Survey on Deep Learning Techniques for Action Anticipation
A Survey on Deep Learning Techniques for Action Anticipation
Zeyun Zhong
Manuel Martin
Michael Voit
Juergen Gall
Jürgen Beyerer
19
7
0
29 Sep 2023
Training a Large Video Model on a Single Machine in a Day
Training a Large Video Model on a Single Machine in a Day
Yue Zhao
Philipp Krahenbuhl
VLM
25
15
0
28 Sep 2023
CAIT: Triple-Win Compression towards High Accuracy, Fast Inference, and
  Favorable Transferability For ViTs
CAIT: Triple-Win Compression towards High Accuracy, Fast Inference, and Favorable Transferability For ViTs
Ao Wang
Hui Chen
Zijia Lin
Sicheng Zhao
J. Han
Guiguang Ding
ViT
21
6
0
27 Sep 2023
Revisiting Kernel Temporal Segmentation as an Adaptive Tokenizer for
  Long-form Video Understanding
Revisiting Kernel Temporal Segmentation as an Adaptive Tokenizer for Long-form Video Understanding
Mohamed Afham
Satya Narayan Shukla
Omid Poursaeed
Pengchuan Zhang
Ashish Shah
Sernam Lim
VLM
16
2
0
20 Sep 2023
RMT: Retentive Networks Meet Vision Transformers
RMT: Retentive Networks Meet Vision Transformers
Qihang Fan
Huaibo Huang
Mingrui Chen
Hongmin Liu
Ran He
ViT
30
73
0
20 Sep 2023
Selective Volume Mixup for Video Action Recognition
Selective Volume Mixup for Video Action Recognition
Yi Tan
Zhaofan Qiu
Y. Hao
Ting Yao
Xiangnan He
Tao Mei
ViT
28
2
0
18 Sep 2023
MMST-ViT: Climate Change-aware Crop Yield Prediction via Multi-Modal
  Spatial-Temporal Vision Transformer
MMST-ViT: Climate Change-aware Crop Yield Prediction via Multi-Modal Spatial-Temporal Vision Transformer
Fudong Lin
Summer Crawford
Kaleb Guillot
Yihe Zhang
Yan Chen
...
Tri Setiyono
B. Tubana
Lu Peng
Magdy A. Bayoumi
N. Tzeng
42
20
0
16 Sep 2023
DeepCompass: AI-driven Location-Orientation Synchronization for
  Navigating Platforms
DeepCompass: AI-driven Location-Orientation Synchronization for Navigating Platforms
Jihun Lee
SP Choi
Bumsoo Kang
Hyekyoung Seok
Hyoungseok Ahn
Sanghee Jung
10
0
0
15 Sep 2023
Empowering Visually Impaired Individuals: A Novel Use of Apple Live
  Photos and Android Motion Photos
Empowering Visually Impaired Individuals: A Novel Use of Apple Live Photos and Android Motion Photos
Seyedalireza Khoshsirat
Chandra Kambhamettu
15
9
0
14 Sep 2023
Disentangling Spatial and Temporal Learning for Efficient Image-to-Video
  Transfer Learning
Disentangling Spatial and Temporal Learning for Efficient Image-to-Video Transfer Learning
Zhiwu Qing
Shiwei Zhang
Ziyuan Huang
Yingya Zhang
Changxin Gao
Deli Zhao
Nong Sang
16
18
0
14 Sep 2023
Co-Salient Object Detection with Semantic-Level Consensus Extraction and
  Dispersion
Co-Salient Object Detection with Semantic-Level Consensus Extraction and Dispersion
Peiran Xu
Yadong Mu
21
7
0
14 Sep 2023
A survey on efficient vision transformers: algorithms, techniques, and
  performance benchmarking
A survey on efficient vision transformers: algorithms, techniques, and performance benchmarking
Lorenzo Papa
Paolo Russo
Irene Amerini
Luping Zhou
12
39
0
05 Sep 2023
RADIO: Reference-Agnostic Dubbing Video Synthesis
RADIO: Reference-Agnostic Dubbing Video Synthesis
Dongyeun Lee
Chaewon Kim
Sangjoon Yu
Jaejun Yoo
Gyeong-Moon Park
VGen
DiffM
13
1
0
05 Sep 2023
DAT++: Spatially Dynamic Vision Transformer with Deformable Attention
DAT++: Spatially Dynamic Vision Transformer with Deformable Attention
Zhuofan Xia
Xuran Pan
Shiji Song
Li Erran Li
Gao Huang
ViT
19
22
0
04 Sep 2023
Self-Supervised Video Transformers for Isolated Sign Language
  Recognition
Self-Supervised Video Transformers for Isolated Sign Language Recognition
Marcelo Sandoval-Castaneda
Yanhong Li
D. Brentari
Karen Livescu
Gregory Shakhnarovich
SLR
8
2
0
02 Sep 2023
Document Layout Analysis on BaDLAD Dataset: A Comprehensive MViTv2 Based
  Approach
Document Layout Analysis on BaDLAD Dataset: A Comprehensive MViTv2 Based Approach
Ashrafur Rahman Khan
Asif Azad
17
0
0
31 Aug 2023
Motion-Guided Masking for Spatiotemporal Representation Learning
Motion-Guided Masking for Spatiotemporal Representation Learning
D. Fan
Jue Wang
Shuai Liao
Yi Zhu
Vimal Bhat
H. Santos-Villalobos
M. Rohith
Xinyu Li
VGen
18
18
0
24 Aug 2023
MOFO: MOtion FOcused Self-Supervision for Video Understanding
MOFO: MOtion FOcused Self-Supervision for Video Understanding
Mona Ahmadian
Frank Guerin
Andrew Gilbert
18
2
0
23 Aug 2023
Vision Transformer Adapters for Generalizable Multitask Learning
Vision Transformer Adapters for Generalizable Multitask Learning
Deblina Bhattacharjee
Sabine Süsstrunk
Mathieu Salzmann
ViT
11
8
0
23 Aug 2023
Towards Privacy-Supporting Fall Detection via Deep Unsupervised
  RGB2Depth Adaptation
Towards Privacy-Supporting Fall Detection via Deep Unsupervised RGB2Depth Adaptation
Hejun Xiao
Kunyu Peng
Xiangsheng Huang
Alina Roitberg
Hao Li
Zhao Wang
Rainer Stiefelhagen
11
3
0
23 Aug 2023
How Much Temporal Long-Term Context is Needed for Action Segmentation?
How Much Temporal Long-Term Context is Needed for Action Segmentation?
Emad Bahrami Rad
Gianpiero Francesca
Juergen Gall
ViT
8
24
0
22 Aug 2023
TeD-SPAD: Temporal Distinctiveness for Self-supervised
  Privacy-preservation for video Anomaly Detection
TeD-SPAD: Temporal Distinctiveness for Self-supervised Privacy-preservation for video Anomaly Detection
Joe Fioresi
I. Dave
M. Shah
34
10
0
21 Aug 2023
Previous
12345678
Next