ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2410.10676
26
3

Both Ears Wide Open: Towards Language-Driven Spatial Audio Generation

14 October 2024
Peiwen Sun
Sitong Cheng
Xin Li
Zhen Ye
Huadai Liu
Honggang Zhang
Wei Xue
Yike Guo
    DiffM
ArXivPDFHTML
Abstract

Recently, diffusion models have achieved great success in mono-channel audio generation. However, when it comes to stereo audio generation, the soundscapes often have a complex scene of multiple objects and directions. Controlling stereo audio with spatial contexts remains challenging due to high data costs and unstable generative models. To the best of our knowledge, this work represents the first attempt to address these issues. We first construct a large-scale, simulation-based, and GPT-assisted dataset, BEWO-1M, with abundant soundscapes and descriptions even including moving and multiple sources. Beyond text modality, we have also acquired a set of images and rationally paired stereo audios through retrieval to advance multimodal generation. Existing audio generation models tend to generate rather random and indistinct spatial audio. To provide accurate guidance for Latent Diffusion Models, we introduce the SpatialSonic model utilizing spatial-aware encoders and azimuth state matrices to reveal reasonable spatial guidance. By leveraging spatial guidance, our model not only achieves the objective of generating immersive and controllable spatial audio from text but also extends to other modalities as the pioneer attempt. Finally, under fair settings, we conduct subjective and objective evaluations on simulated and real-world data to compare our approach with prevailing methods. The results demonstrate the effectiveness of our method, highlighting its capability to generate spatial audio that adheres to physical rules.

View on arXiv
@article{sun2025_2410.10676,
  title={ Both Ears Wide Open: Towards Language-Driven Spatial Audio Generation },
  author={ Peiwen Sun and Sitong Cheng and Xiangtai Li and Zhen Ye and Huadai Liu and Honggang Zhang and Wei Xue and Yike Guo },
  journal={arXiv preprint arXiv:2410.10676},
  year={ 2025 }
}
Comments on this paper