20
0

Streaming Endpointer for Spoken Dialogue using Neural Audio Codecs and Label-Delayed Training

Main:6 Pages
9 Figures
Bibliography:2 Pages
1 Tables
Abstract

Accurate, low-latency endpointing is crucial for effective spoken dialogue systems. While traditional endpointers often rely on spectrum-based audio features, this work proposes real-time speech endpointing for multi-turn dialogues using streaming, low-bitrate Neural Audio Codec (NAC) features, building upon recent advancements in neural audio codecs. To further reduce cutoff errors, we introduce a novel label delay training scheme. At a fixed median latency of 160 ms, our combined NAC and label delay approach achieves significant relative cutoff error reductions: 42.7% for a single-stream endpointer and 37.5% for a two-stream configuration, compared to baseline methods. Finally, we demonstrate efficient integration with a codec-based pretrained speech large language model, improving its median response time by 1200 ms and reducing its cutoff error by 35%.

View on arXiv
@article{udupa2025_2506.07081,
  title={ Streaming Endpointer for Spoken Dialogue using Neural Audio Codecs and Label-Delayed Training },
  author={ Sathvik Udupa and Shinji Watanabe and Petr Schwarz and Jan Cernocky },
  journal={arXiv preprint arXiv:2506.07081},
  year={ 2025 }
}
Comments on this paper