A Streamable Neural Audio Codec with Residual Scalar-Vector Quantization for Real-Time Communication

9 April 2025

Abstract

This paper proposes StreamCodec, a streamable neural audio codec designed for real-time communication. StreamCodec adopts a fully causal, symmetric encoder-decoder structure and operates in the modified discrete cosine transform (MDCT) domain, aiming for low-latency inference and real-time efficient generation. To improve codebook utilization efficiency and compensate for the audio quality loss caused by structural causality, StreamCodec introduces a novel residual scalar-vector quantizer (RSVQ). The RSVQ sequentially connects scalar quantizers and improved vector quantizers in a residual manner, constructing coarse audio contours and refining acoustic details, respectively. Experimental results confirm that the proposed StreamCodec achieves decoded audio quality comparable to advanced non-streamable neural audio codecs. Specifically, on the 16 kHz LibriTTS dataset, StreamCodec attains a ViSQOL score of 4.30 at 1.5 kbps. It has a fixed latency of only 20 ms and achieves a generation speed nearly 20 times real-time on a CPU, with a lightweight model size of just 7M parameters, making it highly suitable for real-time communication applications.

View on arXiv

@article{jiang2025_2504.06561,
  title={ A Streamable Neural Audio Codec with Residual Scalar-Vector Quantization for Real-Time Communication },
  author={ Xiao-Hang Jiang and Yang Ai and Rui-Chen Zheng and Zhen-Hua Ling },
  journal={arXiv preprint arXiv:2504.06561},
  year={ 2025 }
}

Comments on this paper