Efficient Vocal-Conditioned Music Generation via Soft Alignment Attention and Latent Diffusion

26 July 2025

Hei Shing Cheung

Boya Zhang

DiffM

ArXiv (abs)PDF HTML Github (7★)

Main:5 Pages

3 Figures

Bibliography:2 Pages

Abstract

We present a lightweight latent diffusion model for vocal-conditioned musical accompaniment generation that addresses critical limitations in existing music AI systems. Our approach introduces a novel soft alignment attention mechanism that adaptively combines local and global temporal dependencies based on diffusion timesteps, enabling efficient capture of multi- scale musical structure. Operating in the compressed latent space of a pre-trained variational autoencoder, the model achieves a 220 times parameter reduction compared to state-of-the-art systems while delivering 52 times faster inference. Experimental evaluation demonstrates competitive performance with only 15M parame- ters, outperforming OpenAI Jukebox in production quality and content unity while maintaining reasonable musical coherence. The ultra-lightweight architecture enables real-time deployment on consumer hardware, making AI-assisted music creation ac- cessible for interactive applications and resource-constrained environments.

View on arXiv

Comments on this paper