Unveiling the Power of Audio-Visual Early Fusion Transformers with Dense
Interactions through Masked Modeling

Unveiling the Power of Audio-Visual Early Fusion Transformers with Dense Interactions through Masked Modeling

2 December 2023

Shentong Mo

Papers citing "Unveiling the Power of Audio-Visual Early Fusion Transformers with Dense Interactions through Masked Modeling"

15 / 15 papers shown

Title
OpenAVS: Training-Free Open-Vocabulary Audio Visual Segmentation with Foundational Models Shengkai Chen Yifang Yin Jinming Cao Shili Xiang Zhenguang Liu Roger Zimmermann VOS VLM 37 0 0 30 Apr 2025
DiffGAP: A Lightweight Diffusion Module in Contrastive Space for Bridging Cross-Model Gap Shentong Mo Zehua Chen Fan Bao Jun-Jie Zhu DiffM 50 0 0 15 Mar 2025
From Prototypes to General Distributions: An Efficient Curriculum for Masked Image Modeling Jinhong Lin Cheng-En Wu Huanran Li Jifan Zhang Yu Hen Hu Pedro Morgado 23 0 0 16 Nov 2024
Aligning Audio-Visual Joint Representations with an Agentic Workflow Shentong Mo Yibing Song 21 0 0 30 Oct 2024
MMP: Towards Robust Multi-Modal Learning with Masked Modality Projection Niki Nezakati Md Kaykobad Reza Ameya Patil Mashhour Solh M. Salman Asif 27 1 0 03 Oct 2024
Multi-scale Multi-instance Visual Sound Localization and Segmentation Shentong Mo Haofan Wang 25 2 0 31 Aug 2024
Audio-visual Generalized Zero-shot Learning the Easy Way Shentong Mo Pedro Morgado 22 1 0 18 Jul 2024
Semantic Grouping Network for Audio Source Separation Shentong Mo Yapeng Tian 28 4 0 04 Jul 2024
Unified Video-Language Pre-training with Synchronized Audio Shentong Mo Haofan Wang Huaxia Li Xu Tang 27 1 0 12 May 2024
Text-to-Audio Generation Synchronized with Videos Shentong Mo Jing Shi Yapeng Tian DiffM VGen 28 17 0 08 Mar 2024
Weakly-Supervised Audio-Visual Segmentation Shentong Mo Bhiksha Raj VOS 28 12 0 25 Nov 2023
AV-SAM: Segment Anything Model Meets Audio-Visual Localization and Segmentation Shentong Mo Yapeng Tian VLM 79 47 0 03 May 2023
Audio-Visual Segmentation with Semantics Jinxing Zhou Xuyang Shen Jianyuan Wang Jiayi Zhang Weixuan Sun ... Stan Birchfield Dan Guo Lingpeng Kong Meng Wang Yiran Zhong VOS 35 37 0 30 Jan 2023
A Closer Look at Weakly-Supervised Audio-Visual Source Localization Shentong Mo Pedro Morgado 69 64 0 30 Aug 2022
Masked Autoencoders Are Scalable Vision Learners Kaiming He Xinlei Chen Saining Xie Yanghao Li Piotr Dollár Ross B. Girshick ViT TPM 258 7,337 0 11 Nov 2021