38
0

AudioTurbo: Fast Text-to-Audio Generation with Rectified Diffusion

Main:4 Pages
1 Figures
Bibliography:1 Pages
3 Tables
Abstract

Diffusion models have significantly improved the quality and diversity of audio generation but are hindered by slow inference speed. Rectified flow enhances inference speed by learning straight-line ordinary differential equation (ODE) paths. However, this approach requires training a flow-matching model from scratch and tends to perform suboptimally, or even poorly, at low step counts. To address the limitations of rectified flow while leveraging the advantages of advanced pre-trained diffusion models, this study integrates pre-trained models with the rectified diffusion method to improve the efficiency of text-to-audio (TTA) generation. Specifically, we propose AudioTurbo, which learns first-order ODE paths from deterministic noise sample pairs generated by a pre-trained TTA model. Experiments on the AudioCaps dataset demonstrate that our model, with only 10 sampling steps, outperforms prior models and reduces inference to 3 steps compared to a flow-matching-based acceleration model.

View on arXiv
@article{zhao2025_2505.22106,
  title={ AudioTurbo: Fast Text-to-Audio Generation with Rectified Diffusion },
  author={ Junqi Zhao and Jinzheng Zhao and Haohe Liu and Yun Chen and Lu Han and Xubo Liu and Mark Plumbley and Wenwu Wang },
  journal={arXiv preprint arXiv:2505.22106},
  year={ 2025 }
}
Comments on this paper