ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2112.01526
  4. Cited By
MViTv2: Improved Multiscale Vision Transformers for Classification and
  Detection

MViTv2: Improved Multiscale Vision Transformers for Classification and Detection

2 December 2021
Yanghao Li
Chaoxia Wu
Haoqi Fan
K. Mangalam
Bo Xiong
Jitendra Malik
Christoph Feichtenhofer
    ViT
ArXivPDFHTML

Papers citing "MViTv2: Improved Multiscale Vision Transformers for Classification and Detection"

50 / 395 papers shown
Title
Ophora: A Large-Scale Data-Driven Text-Guided Ophthalmic Surgical Video Generation Model
Ophora: A Large-Scale Data-Driven Text-Guided Ophthalmic Surgical Video Generation Model
Wei Li
Ming Hu
Guoan Wang
Lihao Liu
Kaijin Zhou
Junzhi Ning
Xin Guo
Zongyuan Ge
Lixu Gu
Junjun He
18
0
0
12 May 2025
Corner Cases: How Size and Position of Objects Challenge ImageNet-Trained Models
Corner Cases: How Size and Position of Objects Challenge ImageNet-Trained Models
Mishal Fatima
Steffen Jung
M. Keuper
28
0
0
06 May 2025
Learning Streaming Video Representation via Multitask Training
Learning Streaming Video Representation via Multitask Training
Yibin Yan
Jilan Xu
Shangzhe Di
Yikun Liu
Yudi Shi
Qirui Chen
Zeqian Li
Yifei Huang
Weidi Xie
CLL
76
0
0
28 Apr 2025
Hierarchical and Multimodal Data for Daily Activity Understanding
Hierarchical and Multimodal Data for Daily Activity Understanding
Ghazal Kaviani
Yavuz Yarici
Seulgi Kim
M. Prabhushankar
Ghassan AlRegib
Mashhour Solh
Ameya Patil
49
0
0
24 Apr 2025
A multi-scale vision transformer-based multimodal GeoAI model for mapping Arctic permafrost thaw
A multi-scale vision transformer-based multimodal GeoAI model for mapping Arctic permafrost thaw
Wenwen Li
Chia-Yu Hsu
Sizhe Wang
Zhining Gu
Yili Yang
Brendan M. Rogers
A. Liljedahl
50
0
0
23 Apr 2025
Towards Accurate and Interpretable Neuroblastoma Diagnosis via Contrastive Multi-scale Pathological Image Analysis
Towards Accurate and Interpretable Neuroblastoma Diagnosis via Contrastive Multi-scale Pathological Image Analysis
Zhu Zhu
Shuo Jiang
Jingyuan Zheng
Yawen Li
Yifei Chen
Manli Zhao
Weizhong Gu
Feiwei Qin
Jinhu Wang
Gang Yu
MedIm
33
0
0
18 Apr 2025
Action Anticipation from SoccerNet Football Video Broadcasts
Action Anticipation from SoccerNet Football Video Broadcasts
Mohamad Dalal
Artur Xarles
A. Cioppa
Silvio Giancola
Marc Van Droogenbroeck
Bernard Ghanem
Albert Clapés
Sergio Escalera
T. Moeslund
AI4TS
26
0
0
16 Apr 2025
Exploring Video-Based Driver Activity Recognition under Noisy Labels
Exploring Video-Based Driver Activity Recognition under Noisy Labels
Linjuan Fan
Di Wen
Kunyu Peng
Kailun Yang
J. Zhang
...
Yufan Chen
Junwei Zheng
Jiamin Wu
Xudong Han
Rainer Stiefelhagen
NoLa
47
0
0
16 Apr 2025
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models
Jinguo Zhu
Weiyun Wang
Zhe Chen
Z. Liu
Shenglong Ye
...
D. Lin
Yu Qiao
Jifeng Dai
Wenhai Wang
W. Wang
MLLM
VLM
63
6
1
14 Apr 2025
DTFSal: Audio-Visual Dynamic Token Fusion for Video Saliency Prediction
DTFSal: Audio-Visual Dynamic Token Fusion for Video Saliency Prediction
Kiana Hoshanfar
Alireza Hosseini
Ahmad Kalhor
Babak Nadjar Araabi
44
0
0
14 Apr 2025
Adaptive Additive Parameter Updates of Vision Transformers for Few-Shot Continual Learning
Adaptive Additive Parameter Updates of Vision Transformers for Few-Shot Continual Learning
Kyle Stein
A. Mahyari
Guillermo Francia III
Eman El-Sheikh
CLL
58
0
0
11 Apr 2025
Human Activity Recognition using RGB-Event based Sensors: A Multi-modal Heat Conduction Model and A Benchmark Dataset
Human Activity Recognition using RGB-Event based Sensors: A Multi-modal Heat Conduction Model and A Benchmark Dataset
Shiao Wang
X. Wang
Bo Jiang
Lin Zhu
G. Li
Y. Wang
Yonghong Tian
Jin Tang
53
0
0
08 Apr 2025
Towards Generalizing Temporal Action Segmentation to Unseen Views
Towards Generalizing Temporal Action Segmentation to Unseen Views
Emad Bahrami
Olga Zatsarynna
Gianpiero Francesca
Juergen Gall
EgoV
38
0
0
03 Apr 2025
SocialGesture: Delving into Multi-person Gesture Understanding
SocialGesture: Delving into Multi-person Gesture Understanding
Xu Cao
Pranav Virupaksha
Wenqi Jia
Bolin Lai
Fiona Ryan
Sangmin Lee
James M. Rehg
SLR
49
0
0
03 Apr 2025
SMILE: Infusing Spatial and Motion Semantics in Masked Video Learning
SMILE: Infusing Spatial and Motion Semantics in Masked Video Learning
Fida Mohammad Thoker
Letian Jiang
Chen Zhao
Bernard Ghanem
50
0
0
01 Apr 2025
Multi-Task Learning for Extracting Menstrual Characteristics from Clinical Notes
Multi-Task Learning for Extracting Menstrual Characteristics from Clinical Notes
Anna Shopova
Cristoph Lippert
Leslee J. Shaw
Eugenia Alleva
37
0
0
31 Mar 2025
OwlSight: A Robust Illumination Adaptation Framework for Dark Video Human Action Recognition
OwlSight: A Robust Illumination Adaptation Framework for Dark Video Human Action Recognition
Shihao Cheng
Jinlu Zhang
Yue Liu
Zhigang Tu
VLM
37
0
0
30 Mar 2025
Efficient Token Compression for Vision Transformer with Spatial Information Preserved
Efficient Token Compression for Vision Transformer with Spatial Information Preserved
Junzhu Mao
Yang Shen
Jinyang Guo
Yazhou Yao
Xiansheng Hua
ViT
31
0
0
30 Mar 2025
Comparative Analysis of Image, Video, and Audio Classifiers for Automated News Video Segmentation
Comparative Analysis of Image, Video, and Audio Classifiers for Automated News Video Segmentation
Jonathan Attard
Dylan Seychell
46
0
0
27 Mar 2025
Mamba-3D as Masked Autoencoders for Accurate and Data-Efficient Analysis of Medical Ultrasound Videos
Mamba-3D as Masked Autoencoders for Accurate and Data-Efficient Analysis of Medical Ultrasound Videos
Jiaheng Zhou
Yanfeng Zhou
Wei Fang
Yuxing Tang
Le Lu
Ge Yang
Mamba
119
0
0
26 Mar 2025
Surg-3M: A Dataset and Foundation Model for Perception in Surgical Settings
Surg-3M: A Dataset and Foundation Model for Perception in Surgical Settings
Chengan Che
Chao Wang
Tom Vercauteren
Sophia Tsoka
Luis C. García-Peraza-Herrera
MedIm
36
0
0
25 Mar 2025
VTD-CLIP: Video-to-Text Discretization via Prompting CLIP
VTD-CLIP: Video-to-Text Discretization via Prompting CLIP
Wencheng Zhu
Yuexin Wang
Hongxuan Li
Pengfei Zhu
Q. Hu
CLIP
48
0
0
24 Mar 2025
Beyond Accuracy: What Matters in Designing Well-Behaved Models?
Beyond Accuracy: What Matters in Designing Well-Behaved Models?
Robin Hesse
Doğukan Bağcı
Bernt Schiele
Simone Schaub-Meyer
Stefan Roth
VLM
54
0
0
21 Mar 2025
Stitch-a-Recipe: Video Demonstration from Multistep Descriptions
Stitch-a-Recipe: Video Demonstration from Multistep Descriptions
Chi Hsuan Wu
Kumar Ashutosh
Kristen Grauman
DiffM
58
0
0
18 Mar 2025
Towards Scalable Modeling of Compressed Videos for Efficient Action Recognition
Towards Scalable Modeling of Compressed Videos for Efficient Action Recognition
Shristi Das Biswas
Efstathia Soufleri
Arani Roy
Kaushik Roy
49
0
0
17 Mar 2025
Towards Fast, Memory-based and Data-Efficient Vision-Language Policy
Haoxuan Li
Sixu Yan
Y. Li
Xinggang Wang
LM&Ro
59
0
0
13 Mar 2025
PromptGAR: Flexible Promptive Group Activity Recognition
Zhangyu Jin
Andrew Feng
Ankur Chemburkar
Celso M. De Melo
VLM
39
0
0
11 Mar 2025
Semi-Supervised Audio-Visual Video Action Recognition with Audio Source Localization Guided Mixup
Seokun Kang
Taehwan Kim
37
0
0
04 Mar 2025
KeyFace: Expressive Audio-Driven Facial Animation for Long Sequences via KeyFrame Interpolation
KeyFace: Expressive Audio-Driven Facial Animation for Long Sequences via KeyFrame Interpolation
Antoni Bigata
Michał Stypułkowski
Rodrigo Mira
Stella Bounareli
Konstantinos Vougioukas
Zoe Landgraf
Nikita Drobyshev
Maciej Ziȩba
Stavros Petridis
M. Pantic
DiffM
VGen
61
2
0
03 Mar 2025
An Efficient Approach to Detecting Lung Nodules Using Swin Transformer
Saeed Shakuri
Alireza Rezvanian
ViT
MedIm
34
1
0
03 Mar 2025
The PanAf-FGBG Dataset: Understanding the Impact of Backgrounds in Wildlife Behaviour Recognition
The PanAf-FGBG Dataset: Understanding the Impact of Backgrounds in Wildlife Behaviour Recognition
Otto Brookes
Maksim Kukushkin
Majid Mirmehdi
Colleen Stephens
Paula Dieguez
...
Lukas Boesch
Thomas Schmid
M. Arandjelovic
H. Kühl
T. Burghardt
46
0
0
28 Feb 2025
OpenTAD: A Unified Framework and Comprehensive Study of Temporal Action Detection
OpenTAD: A Unified Framework and Comprehensive Study of Temporal Action Detection
Shuming Liu
Chen Zhao
Fatimah Zohra
Mattia Soldan
Alejandro Pardo
...
Juan Carlos León Alcázar
A. Cioppa
Silvio Giancola
Carlos Hinojosa
Bernard Ghanem
55
3
0
27 Feb 2025
Hierarchical Context Transformer for Multi-level Semantic Scene Understanding
Hierarchical Context Transformer for Multi-level Semantic Scene Understanding
Luoying Hao
Yan Hu
Yang Yue
Li Wu
Huazhu Fu
Jinming Duan
Jiang Liu
57
0
0
24 Feb 2025
iFormer: Integrating ConvNet and Transformer for Mobile Application
iFormer: Integrating ConvNet and Transformer for Mobile Application
Chuanyang Zheng
ViT
65
0
0
26 Jan 2025
MS-Temba : Multi-Scale Temporal Mamba for Efficient Temporal Action Detection
MS-Temba : Multi-Scale Temporal Mamba for Efficient Temporal Action Detection
Arkaprava Sinha
Monish Soundar Raj
Pu Wang
Ahmed Helmy
Srijan Das
Mamba
46
3
0
10 Jan 2025
Multiscaled Multi-Head Attention-based Video Transformer Network for Hand Gesture Recognition
Mallika Garg
Debashis Ghosh
P. M. Pradhan
SLR
30
15
0
03 Jan 2025
Breaking the Context Bottleneck on Long Time Series Forecasting
Breaking the Context Bottleneck on Long Time Series Forecasting
Chao Ma
Yikai Hou
Xiang Li
Yinggang Sun
Haining Yu
Zhou Fang
Jiaxing Qu
AI4TS
65
0
0
21 Dec 2024
ImagePiece: Content-aware Re-tokenization for Efficient Image
  Recognition
ImagePiece: Content-aware Re-tokenization for Efficient Image Recognition
Seungdong Yoa
Seungjun Lee
Hyeseung Cho
Bumsoo Kim
Woohyung Lim
ViT
67
0
0
21 Dec 2024
Bridging the Divide: Reconsidering Softmax and Linear Attention
Bridging the Divide: Reconsidering Softmax and Linear Attention
Dongchen Han
Yifan Pu
Zhuofan Xia
Yizeng Han
Xuran Pan
Xiu Li
Jiwen Lu
Shiji Song
Gao Huang
61
2
0
09 Dec 2024
OmniGuard: Hybrid Manipulation Localization via Augmented Versatile Deep Image Watermarking
OmniGuard: Hybrid Manipulation Localization via Augmented Versatile Deep Image Watermarking
X. Zhang
Zecheng Tang
Zhipei Xu
Runyi Li
Youmin Xu
Bin Chen
Feng Gao
Jian Andrew Zhang
WIGM
93
4
0
02 Dec 2024
OccludeNet: A Causal Journey into Mixed-View Actor-Centric Video Action
  Recognition under Occlusions
OccludeNet: A Causal Journey into Mixed-View Actor-Centric Video Action Recognition under Occlusions
Guanyu Zhou
Wenxuan Liu
Wenxin Huang
Xuemei Jia
X. Zhong
Chia-Wen Lin
CML
69
0
0
24 Nov 2024
Learning Collective Dynamics of Multi-Agent Systems using Event-based
  Vision
Learning Collective Dynamics of Multi-Agent Systems using Event-based Vision
Minah Lee
Uday Kamal
Saibal Mukhopadhyay
18
0
0
11 Nov 2024
Don't Look Twice: Faster Video Transformers with Run-Length Tokenization
Don't Look Twice: Faster Video Transformers with Run-Length Tokenization
Rohan Choudhury
Guanglei Zhu
Sihan Liu
Koichiro Niinuma
Kris M. Kitani
László A. Jeni
26
9
0
07 Nov 2024
AM Flow: Adapters for Temporal Processing in Action Recognition
AM Flow: Adapters for Temporal Processing in Action Recognition
Tanay Agrawal
Abid Ali
A. Dantcheva
François Brémond
21
0
0
04 Nov 2024
HiMemFormer: Hierarchical Memory-Aware Transformer for Multi-Agent
  Action Anticipation
HiMemFormer: Hierarchical Memory-Aware Transformer for Multi-Agent Action Anticipation
Zirui Wang
Xinran Zhao
Simon Stepputtis
Woojun Kim
Tongshuang Wu
Katia P. Sycara
Yaqi Xie
OffRL
39
0
0
03 Nov 2024
Video Token Merging for Long-form Video Understanding
Video Token Merging for Long-form Video Understanding
Seon-Ho Lee
Jue Wang
Zhikang Zhang
D. Fan
Xinyu Li
33
5
0
31 Oct 2024
Enhancing Action Recognition by Leveraging the Hierarchical Structure of
  Actions and Textual Context
Enhancing Action Recognition by Leveraging the Hierarchical Structure of Actions and Textual Context
Manuel Benavent-Lledo
David Mulero-Pérez
David Ortiz-Perez
José García Rodríguez
Antonis Argyros
24
0
0
28 Oct 2024
Multi-Class Abnormality Classification Task in Video Capsule Endoscopy
Multi-Class Abnormality Classification Task in Video Capsule Endoscopy
Dev Rishi Verma
Vibhor Saxena
Dhruv Sharma
Arpan Gupta
19
1
0
25 Oct 2024
On Occlusions in Video Action Detection: Benchmark Datasets And Training
  Recipes
On Occlusions in Video Action Detection: Benchmark Datasets And Training Recipes
Rajat Modi
Vibhav Vineet
Y. S. Rawat
31
1
0
25 Oct 2024
Deep Insights into Cognitive Decline: A Survey of Leveraging
  Non-Intrusive Modalities with Deep Learning Techniques
Deep Insights into Cognitive Decline: A Survey of Leveraging Non-Intrusive Modalities with Deep Learning Techniques
David Ortiz-Perez
Manuel Benavent-Lledo
José García Rodríguez
David Tomás
M. Flores Vizcaya-Moreno
18
0
0
24 Oct 2024
12345678
Next