Streaming Endpointer for Spoken Dialogue using Neural Audio Codecs and Label-Delayed Training

8 June 2025

Sathvik Udupa

Shinji Watanabe

Petr Schwarz

Jan ''Honza'' Cernocký

ArXiv (abs)PDF HTML Github (158258★)

Main:6 Pages

9 Figures

Bibliography:2 Pages

1 Tables

Abstract

Accurate, low-latency endpointing is crucial for effective spoken dialogue systems. While traditional endpointers often rely on spectrum-based audio features, this work proposes real-time speech endpointing for multi-turn dialogues using streaming, low-bitrate Neural Audio Codec (NAC) features, building upon recent advancements in neural audio codecs. To further reduce cutoff errors, we introduce a novel label delay training scheme. At a fixed median latency of 160 ms, our combined NAC and label delay approach achieves significant relative cutoff error reductions: 42.7% for a single-stream endpointer and 37.5% for a two-stream configuration, compared to baseline methods. Finally, we demonstrate efficient integration with a codec-based pretrained speech large language model, improving its median response time by 1200 ms and reducing its cutoff error by 35%.

View on arXiv

Comments on this paper