ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2203.12602
  4. Cited By
VideoMAE: Masked Autoencoders are Data-Efficient Learners for
  Self-Supervised Video Pre-Training

VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training

23 March 2022
Zhan Tong
Yibing Song
Jue Wang
Limin Wang
    ViT
ArXivPDFHTML

Papers citing "VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training"

50 / 142 papers shown
Title
Read My Ears! Horse Ear Movement Detection for Equine Affective State Assessment
Read My Ears! Horse Ear Movement Detection for Equine Affective State Assessment
João Alves
Pia Haubro Andersen
Rikke Gade
28
0
0
06 May 2025
Action Spotting and Precise Event Detection in Sports: Datasets, Methods, and Challenges
Action Spotting and Precise Event Detection in Sports: Datasets, Methods, and Challenges
Hao Xu
Arbind Agrahari Baniya
Sam Well
Mohamed Reda Bouadjenek
Richard Dazeley
S. Aryal
AI4TS
22
0
0
06 May 2025
VAEmo: Efficient Representation Learning for Visual-Audio Emotion with Knowledge Injection
VAEmo: Efficient Representation Learning for Visual-Audio Emotion with Knowledge Injection
Hao Cheng
Zhiwei Zhao
Yichao He
Zhenzhen Hu
Jia Li
M. Wang
Richang Hong
36
0
0
05 May 2025
LightEMMA: Lightweight End-to-End Multimodal Model for Autonomous Driving
LightEMMA: Lightweight End-to-End Multimodal Model for Autonomous Driving
Zhijie Qiao
Haowei Li
Zhong Cao
Henry X. Liu
VLM
76
2
0
01 May 2025
Video CLIP Model for Multi-View Echocardiography Interpretation
Video CLIP Model for Multi-View Echocardiography Interpretation
Ryo Takizawa
Satoshi Kodera
Tempei Kabayama
Ryo Matsuoka
Yuta Ando
Yuto Nakamura
Haruki Settai
Norihiko Takeda
27
0
0
26 Apr 2025
ActionArt: Advancing Multimodal Large Models for Fine-Grained Human-Centric Video Understanding
ActionArt: Advancing Multimodal Large Models for Fine-Grained Human-Centric Video Understanding
Yi-Xing Peng
Q. Yang
Yu-Ming Tang
Shenghao Fu
Kun-Yu Lin
Xihan Wei
Wei-Shi Zheng
40
0
0
25 Apr 2025
Latent Video Dataset Distillation
Latent Video Dataset Distillation
Ning Li
Antai Andy Liu
Jingran Zhang
Justin Cui
DD
VGen
65
0
0
23 Apr 2025
Breaking the Barriers: Video Vision Transformers for Word-Level Sign Language Recognition
Breaking the Barriers: Video Vision Transformers for Word-Level Sign Language Recognition
Alexander Brettmann
Jakob Grävinghoff
Marlene Rüschoff
Marie Westhues
SLR
46
0
0
10 Apr 2025
TAPNext: Tracking Any Point (TAP) as Next Token Prediction
TAPNext: Tracking Any Point (TAP) as Next Token Prediction
Artem Zholus
Carl Doersch
Yi Yang
Skanda Koppula
Viorica Patraucean
Xu He
Ignacio Rocco
Mehdi S. M. Sajjadi
Sarath Chandar
Ross Goroshin
28
0
0
08 Apr 2025
Post-processing for Fair Regression via Explainable SVD
Post-processing for Fair Regression via Explainable SVD
Zhiqun Zuo
Ding Zhu
Mohammad Mahdi Khalili
60
0
0
04 Apr 2025
CBIL: Collective Behavior Imitation Learning for Fish from Real Videos
CBIL: Collective Behavior Imitation Learning for Fish from Real Videos
Yifan Wu
Zhiyang Dou
Yuko Ishiwaka
Shun Ogawa
Yuke Lou
Wenping Wang
Lingjie Liu
Taku Komura
40
3
0
31 Mar 2025
An Empirical Study of the Impact of Federated Learning on Machine Learning Model Accuracy
An Empirical Study of the Impact of Federated Learning on Machine Learning Model Accuracy
Haotian Yang
Z. Wang
Benson Chou
Sophie Xu
Hao Wang
Jingxian Wang
Qizhen Zhang
FedML
85
0
0
26 Mar 2025
Mamba-3D as Masked Autoencoders for Accurate and Data-Efficient Analysis of Medical Ultrasound Videos
Mamba-3D as Masked Autoencoders for Accurate and Data-Efficient Analysis of Medical Ultrasound Videos
Jiaheng Zhou
Yanfeng Zhou
Wei Fang
Yuxing Tang
Le Lu
Ge Yang
Mamba
121
0
0
26 Mar 2025
Video-XL-Pro: Reconstructive Token Compression for Extremely Long Video Understanding
Video-XL-Pro: Reconstructive Token Compression for Extremely Long Video Understanding
Xiangrui Liu
Yan Shu
Zheng Liu
Ao Li
Yang Tian
Bo Zhao
VGen
VLM
86
0
0
24 Mar 2025
Towards Scalable Foundation Model for Multi-modal and Hyperspectral Geospatial Data
Towards Scalable Foundation Model for Multi-modal and Hyperspectral Geospatial Data
Haozhe Si
Yuxuan Wan
Minh Do
Deepak Vasisht
Han Zhao
Hendrik Hamann
38
0
0
17 Mar 2025
A Large-Scale Study on Video Action Dataset Condensation
A Large-Scale Study on Video Action Dataset Condensation
Yang Chen
Sheng Guo
Bo Zheng
Limin Wang
DD
77
2
0
13 Mar 2025
AudioX: Diffusion Transformer for Anything-to-Audio Generation
AudioX: Diffusion Transformer for Anything-to-Audio Generation
Zeyue Tian
Yizhu Jin
Zhaoyang Liu
Ruibin Yuan
Xu Tan
Qifeng Chen
Wei Xue
Y. Guo
65
3
0
13 Mar 2025
End-to-End Action Segmentation Transformer
Tieqiao Wang
Sinisa Todorovic
ViT
37
0
0
08 Mar 2025
Learning to Animate Images from A Few Videos to Portray Delicate Human Actions
Haoxin Li
Yingchen Yu
Qilong Wu
Hanwang Zhang
Boyang Li
Song Bai
3DH
VGen
59
0
0
01 Mar 2025
Stealthy Backdoor Attack in Self-Supervised Learning Vision Encoders for Large Vision Language Models
Stealthy Backdoor Attack in Self-Supervised Learning Vision Encoders for Large Vision Language Models
Zhaoyi Liu
Huan Zhang
AAML
72
0
0
25 Feb 2025
Animate Your Thoughts: Decoupled Reconstruction of Dynamic Natural Vision from Slow Brain Activity
Animate Your Thoughts: Decoupled Reconstruction of Dynamic Natural Vision from Slow Brain Activity
Yizhuo Lu
Changde Du
Chong Wang
Xuanliu Zhu
Liuyun Jiang
Xujin Li
Huiguang He
VGen
102
4
0
20 Feb 2025
L4P: Low-Level 4D Vision Perception Unified
L4P: Low-Level 4D Vision Perception Unified
Abhishek Badki
Hang Su
Bowen Wen
Orazio Gallo
VLM
75
1
0
18 Feb 2025
AnyTouch: Learning Unified Static-Dynamic Representation across Multiple Visuo-tactile Sensors
AnyTouch: Learning Unified Static-Dynamic Representation across Multiple Visuo-tactile Sensors
Ruoxuan Feng
Jiangyu Hu
Wenke Xia
Tianci Gao
Ao Shen
Yuhao Sun
Bin Fang
Di Hu
42
2
0
15 Feb 2025
Learning Human Skill Generators at Key-Step Levels
Learning Human Skill Generators at Key-Step Levels
Yilu Wu
Chenhui Zhu
Shuai Wang
Hanlin Wang
Jing Wang
Zhaoxiang Zhang
Limin Wang
VGen
112
0
0
12 Feb 2025
Ask in Any Modality: A Comprehensive Survey on Multimodal Retrieval-Augmented Generation
Ask in Any Modality: A Comprehensive Survey on Multimodal Retrieval-Augmented Generation
Mohammad Mahdi Abootorabi
Amirhosein Zobeiri
Mahdi Dehghani
Mohammadali Mohammadkhani
Bardia Mohammadi
Omid Ghahroodi
M. Baghshah
Ehsaneddin Asgari
RALM
93
3
0
12 Feb 2025
Data-efficient Performance Modeling via Pre-training
Data-efficient Performance Modeling via Pre-training
Chunting Liu
Riyadh Baghdadi
37
0
0
24 Jan 2025
Slot-BERT: Self-supervised Object Discovery in Surgical Video
Slot-BERT: Self-supervised Object Discovery in Surgical Video
Guiqiu Liao
M. Jogan
Marcel Hussing
Kenta Nakahashi
Kazuhiro Yasufuku
Amin Madani
Eric Eaton
Daniel A. Hashimoto
43
0
0
21 Jan 2025
MetaNeRV: Meta Neural Representations for Videos with Spatial-Temporal Guidance
MetaNeRV: Meta Neural Representations for Videos with Spatial-Temporal Guidance
Jialong Guo
Ke Liu
Jiangchao Yao
Zhihua Wang
Jiajun Bu
Haishuai Wang
AI4TS
40
0
0
20 Jan 2025
FutureDepth: Learning to Predict the Future Improves Video Depth Estimation
FutureDepth: Learning to Predict the Future Improves Video Depth Estimation
R. Yasarla
Manish Kumar Singh
Hong Cai
Yunxiao Shi
Jisoo Jeong
Yinhao Zhu
Shizhong Han
Risheek Garrepalli
Fatih Porikli
MDE
80
5
0
17 Jan 2025
RoboHorizon: An LLM-Assisted Multi-View World Model for Long-Horizon Robotic Manipulation
RoboHorizon: An LLM-Assisted Multi-View World Model for Long-Horizon Robotic Manipulation
Zixuan Chen
Jing Huo
Yangtao Chen
Yang Gao
43
2
0
11 Jan 2025
Multi-subject Open-set Personalization in Video Generation
Multi-subject Open-set Personalization in Video Generation
Tsai-Shien Chen
Aliaksandr Siarohin
Willi Menapace
Yuwei Fang
Kwot Sin Lee
Ivan Skorokhodov
Kfir Aberman
Jun-Yan Zhu
Ming Yang
Sergey Tulyakov
DiffM
VGen
69
7
0
10 Jan 2025
MVP: Multimodal Emotion Recognition based on Video and Physiological Signals
Valeriya Strizhkova
Hadi Kachmar
Hava Chaptoukaev
Raphael Kalandadze
Natia Kukhilava
...
Maria A. Zuluaga
Michal Balazia
A. Dantcheva
François Brémond
Laura M. Ferrari
30
0
0
06 Jan 2025
A Simple Recipe for Contrastively Pre-training Video-First Encoders Beyond 16 Frames
A Simple Recipe for Contrastively Pre-training Video-First Encoders Beyond 16 Frames
Pinelopi Papalampidi
Skanda Koppula
Shreya Pathak
Justin T Chiu
Joseph Heyward
Viorica Patraucean
Jiajun Shen
Antoine Miech
Andrew Zisserman
Aida Nematzdeh
VLM
56
23
0
31 Dec 2024
TAMT: Temporal-Aware Model Tuning for Cross-Domain Few-Shot Action Recognition
TAMT: Temporal-Aware Model Tuning for Cross-Domain Few-Shot Action Recognition
Yilong Wang
Zilin Gao
Qilong Wang
Zhaofeng Chen
P. Li
Q. Hu
72
1
0
28 Nov 2024
SelfSplat: Pose-Free and 3D Prior-Free Generalizable 3D Gaussian Splatting
SelfSplat: Pose-Free and 3D Prior-Free Generalizable 3D Gaussian Splatting
Gyeongjin Kang
Jisang Yoo
Jihyeon Park
Seungtae Nam
Hyeonsoo Im
Sangheon Shin
Sangpil Kim
Eunbyung Park
3DGS
98
3
0
26 Nov 2024
VideoOrion: Tokenizing Object Dynamics in Videos
VideoOrion: Tokenizing Object Dynamics in Videos
Yicheng Feng
Yijiang Li
Wanpeng Zhang
Sipeng Zheng
Zongqing Lu
Sipeng Zheng
Zongqing Lu
91
1
0
25 Nov 2024
Tra-MoE: Learning Trajectory Prediction Model from Multiple Domains for Adaptive Policy Conditioning
Tra-MoE: Learning Trajectory Prediction Model from Multiple Domains for Adaptive Policy Conditioning
Jiange Yang
Haoyi Zhu
Y. Wang
Gangshan Wu
Tong He
Limin Wang
86
2
0
21 Nov 2024
Principles of Visual Tokens for Efficient Video Understanding
Principles of Visual Tokens for Efficient Video Understanding
Xinyue Hao
Gen Li
Shreyank N. Gowda
Robert B Fisher
Jonathan Huang
Anurag Arnab
Laura Sevilla-Lara
73
0
0
20 Nov 2024
Efficient Masked AutoEncoder for Video Object Counting and A Large-Scale Benchmark
Efficient Masked AutoEncoder for Video Object Counting and A Large-Scale Benchmark
Bing Cao
Quanhao Lu
Jiekang Feng
Pengfei Zhu
Q. Hu
Qilong Wang
64
0
0
20 Nov 2024
Efficient Transfer Learning for Video-language Foundation Models
Haoxing Chen
Zizheng Huang
Y. Hong
Yanshuo Wang
Zhongcai Lyu
Zhuoer Xu
Jun Lan
Zhangxuan Gu
VLM
41
0
0
18 Nov 2024
ViBe: A Text-to-Video Benchmark for Evaluating Hallucination in Large Multimodal Models
Vipula Rawte
Sarthak Jain
Aarush Sinha
Garv Kaushik
Aman Bansal
...
Aishwarya N. Reganti
Vinija Jain
Aman Chadha
A. Sheth
A. Das
VLM
MLLM
46
1
0
16 Nov 2024
TIP-I2V: A Million-Scale Real Text and Image Prompt Dataset for
  Image-to-Video Generation
TIP-I2V: A Million-Scale Real Text and Image Prompt Dataset for Image-to-Video Generation
Wenhao Wang
Y. Yang
VGen
40
3
0
05 Nov 2024
EchoFM: Foundation Model for Generalizable Echocardiogram Analysis
EchoFM: Foundation Model for Generalizable Echocardiogram Analysis
Sekeun Kim
Pengfei Jin
S. Song
Cheng Chen
Yiwei Li
Hui Ren
Xiang Li
Tianming Liu
Quanzheng Li
26
0
0
30 Oct 2024
Revisiting MAE pre-training for 3D medical image segmentation
Revisiting MAE pre-training for 3D medical image segmentation
Tassilo Wald
Constantin Ulrich
Stanislav Lukyanenko
Andrei Goncharov
Alberto Paderno
Leander Maerkisch
Paul F. Jäger
Paul F. Jäger
Klaus Maier-Hein
30
2
0
30 Oct 2024
SPA: 3D Spatial-Awareness Enables Effective Embodied Representation
SPA: 3D Spatial-Awareness Enables Effective Embodied Representation
Haoyi Zhu
Honghui Yang
Yating Wang
Jiange Yang
Limin Wang
Tong He
3DH
43
5
0
10 Oct 2024
Lexicon3D: Probing Visual Foundation Models for Complex 3D Scene Understanding
Lexicon3D: Probing Visual Foundation Models for Complex 3D Scene Understanding
Yunze Man
Shuhong Zheng
Zhipeng Bao
M. Hebert
Liang-Yan Gui
Yu-xiong Wang
70
15
0
05 Sep 2024
StimuVAR: Spatiotemporal Stimuli-aware Video Affective Reasoning with
  Multimodal Large Language Models
StimuVAR: Spatiotemporal Stimuli-aware Video Affective Reasoning with Multimodal Large Language Models
Y. Guo
Faizan Siddiqui
Yang Zhao
Rama Chellappa
Shao-Yuan Lo
LRM
24
2
0
31 Aug 2024
An Efficient Sign Language Translation Using Spatial Configuration and Motion Dynamics with LLMs
An Efficient Sign Language Translation Using Spatial Configuration and Motion Dynamics with LLMs
Eui Jun Hwang
Sukmin Cho
Junmyeong Lee
Jong C. Park
SLR
59
4
0
20 Aug 2024
PooDLe: Pooled and dense self-supervised learning from naturalistic videos
PooDLe: Pooled and dense self-supervised learning from naturalistic videos
Alex N. Wang
Christopher Hoang
Yuwen Xiong
Yann LeCun
Mengye Ren
64
0
0
20 Aug 2024
Membership Inference Attack Against Masked Image Modeling
Membership Inference Attack Against Masked Image Modeling
Z. Li
Xinlei He
Ning Yu
Yang Zhang
35
1
0
13 Aug 2024
123
Next