ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2502.11946
69
6

Step-Audio: Unified Understanding and Generation in Intelligent Speech Interaction

17 February 2025
Ailin Huang
Boyong Wu
Bruce Wang
Chao Yan
Chen Hu
Chengli Feng
Fei Tian
Feiyu Shen
J. Li
M. Chen
Peng Liu
Ruihang Miao
Wang You
Xi Chen
Xuerui Yang
Y. Huang
Yuxiang Zhang
Zheng Gong
Zixin Zhang
Hongyu Zhou
Jianjian Sun
B. Li
Chengting Feng
Changyi Wan
Hanpeng Hu
Jianchang Wu
Jiangjie Zhen
Ranchen Ming
Song Yuan
X. Zhang
Yu Zhou
B. Li
Buyun Ma
H. Wang
Kang An
Wei Ji
W. Li
Xuan Wen
Xiangwen Kong
Yuankai Ma
Yuanwei Liang
Yun Mou
Bahtiyar Ahmidi
Bin Wang
Bo-wen Li
Changxin Miao
C. Xu
Chenrun Wang
Dapeng Shi
Deshan Sun
Dingyuan Hu
Dula Sai
Enle Liu
Guanzhe Huang
Gulin Yan
H. Wang
Haonan Jia
H. Zhang
Jiahao Gong
J. Guo
J. Liu
J. H. Liu
Jie Feng
Jie Wu
J. Wu
Jie Yang
J. T. Wang
Jingyang Zhang
Junzhe Lin
K. Li
Lei Xia
Li Zhou
Liang Zhao
Longlong Gu
Mei Chen
Menglin Wu
Ming Li
Mingxiao Li
M. Li
Mingyao Liang
Na Wang
Nie Hao
Qiling Wu
Qinyuan Tan
R.-H. Sun
S.
Shaoliang Pang
S. M. I. Simon X. Yang
Shuli Gao
Shanshan Yuan
Siqi Liu
Shihong Deng
Shilei Jiang
S. Liu
Tiancheng Cao
Tianyu Wang
Wenjin Deng
Wuxun Xie
Weipeng Ming
Wenqing He
    AuLLM
ArXivPDFHTML
Abstract

Real-time speech interaction, serving as a fundamental interface for human-machine collaboration, holds immense potential. However, current open-source models face limitations such as high costs in voice data collection, weakness in dynamic control, and limited intelligence. To address these challenges, this paper introduces Step-Audio, the first production-ready open-source solution. Key contributions include: 1) a 130B-parameter unified speech-text multi-modal model that achieves unified understanding and generation, with the Step-Audio-Chat version open-sourced; 2) a generative speech data engine that establishes an affordable voice cloning framework and produces the open-sourced lightweight Step-Audio-TTS-3B model through distillation; 3) an instruction-driven fine control system enabling dynamic adjustments across dialects, emotions, singing, and RAP; 4) an enhanced cognitive architecture augmented with tool calling and role-playing abilities to manage complex tasks effectively. Based on our new StepEval-Audio-360 evaluation benchmark, Step-Audio achieves state-of-the-art performance in human evaluations, especially in terms of instruction following. On open-source benchmarks like LLaMA Question, shows 9.3% average performance improvement, demonstrating our commitment to advancing the development of open-source multi-modal language technologies. Our code and models are available atthis https URL.

View on arXiv
@article{huang2025_2502.11946,
  title={ Step-Audio: Unified Understanding and Generation in Intelligent Speech Interaction },
  author={ Ailin Huang and Boyong Wu and Bruce Wang and Chao Yan and Chen Hu and Chengli Feng and Fei Tian and Feiyu Shen and Jingbei Li and Mingrui Chen and Peng Liu and Ruihang Miao and Wang You and Xi Chen and Xuerui Yang and Yechang Huang and Yuxiang Zhang and Zheng Gong and Zixin Zhang and Hongyu Zhou and Jianjian Sun and Brian Li and Chengting Feng and Changyi Wan and Hanpeng Hu and Jianchang Wu and Jiangjie Zhen and Ranchen Ming and Song Yuan and Xuelin Zhang and Yu Zhou and Bingxin Li and Buyun Ma and Hongyuan Wang and Kang An and Wei Ji and Wen Li and Xuan Wen and Xiangwen Kong and Yuankai Ma and Yuanwei Liang and Yun Mou and Bahtiyar Ahmidi and Bin Wang and Bo Li and Changxin Miao and Chen Xu and Chenrun Wang and Dapeng Shi and Deshan Sun and Dingyuan Hu and Dula Sai and Enle Liu and Guanzhe Huang and Gulin Yan and Heng Wang and Haonan Jia and Haoyang Zhang and Jiahao Gong and Junjing Guo and Jiashuai Liu and Jiahong Liu and Jie Feng and Jie Wu and Jiaoren Wu and Jie Yang and Jinguo Wang and Jingyang Zhang and Junzhe Lin and Kaixiang Li and Lei Xia and Li Zhou and Liang Zhao and Longlong Gu and Mei Chen and Menglin Wu and Ming Li and Mingxiao Li and Mingliang Li and Mingyao Liang and Na Wang and Nie Hao and Qiling Wu and Qinyuan Tan and Ran Sun and Shuai Shuai and Shaoliang Pang and Shiliang Yang and Shuli Gao and Shanshan Yuan and Siqi Liu and Shihong Deng and Shilei Jiang and Sitong Liu and Tiancheng Cao and Tianyu Wang and Wenjin Deng and Wuxun Xie and Weipeng Ming and Wenqing He },
  journal={arXiv preprint arXiv:2502.11946},
  year={ 2025 }
}
Comments on this paper