Singing voice synthesis (SVS) system is expected to generate high-fidelity singing voice from given music scores (lyrics, duration and pitch). Recently, diffusion models have performed well in this field. However, sacrificing inference speed to exchange with high-quality sample generation limits its application scenarios. In order to obtain high quality synthetic singing voice more efficiently, we propose a singing voice synthesis method based on the consistency model, ConSinger, to achieve high-fidelity singing voice synthesis with minimal steps. The model is trained by applying consistency constraint and the generation quality is greatly improved at the expense of a small amount of inference speed. Our experiments show that ConSinger is highly competitive with the baseline model in terms of generation speed and quality. Audio samples are available atthis https URL.
View on arXiv@article{song2025_2410.15342, title={ ConSinger: Efficient High-Fidelity Singing Voice Generation with Minimal Steps }, author={ Yulin Song and Guorui Sang and Jing Yu and Chuangbai Xiao }, journal={arXiv preprint arXiv:2410.15342}, year={ 2025 } }