ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2410.02130
28
3

MDSGen: Fast and Efficient Masked Diffusion Temporal-Aware Transformers for Open-Domain Sound Generation

3 October 2024
T. Pham
Tri Ton
Chang D. Yoo
ArXivPDFHTML
Abstract

We introduce MDSGen, a novel framework for vision-guided open-domain sound generation optimized for model parameter size, memory consumption, and inference speed. This framework incorporates two key innovations: (1) a redundant video feature removal module that filters out unnecessary visual information, and (2) a temporal-aware masking strategy that leverages temporal context for enhanced audio generation accuracy. In contrast to existing resource-heavy Unet-based models, \texttt{MDSGen} employs denoising masked diffusion transformers, facilitating efficient generation without reliance on pre-trained diffusion models. Evaluated on the benchmark VGGSound dataset, our smallest model (5M parameters) achieves 97.997.997.9% alignment accuracy, using 172×172\times172× fewer parameters, 371371371% less memory, and offering 36×36\times36× faster inference than the current 860M-parameter state-of-the-art model (93.993.993.9% accuracy). The larger model (131M parameters) reaches nearly 999999% accuracy while requiring 6.5×6.5\times6.5× fewer parameters. These results highlight the scalability and effectiveness of our approach. The code is available atthis https URL.

View on arXiv
@article{pham2025_2410.02130,
  title={ MDSGen: Fast and Efficient Masked Diffusion Temporal-Aware Transformers for Open-Domain Sound Generation },
  author={ Trung X. Pham and Tri Ton and Chang D. Yoo },
  journal={arXiv preprint arXiv:2410.02130},
  year={ 2025 }
}
Comments on this paper