20
0

Masked Self-distilled Transducer-based Keyword Spotting with Semi-autoregressive Decoding

Main:6 Pages
1 Figures
Bibliography:1 Pages
Abstract

RNN-T-based keyword spotting (KWS) with autoregressive decoding~(AR) has gained attention due to its streaming architecture and superior performance. However, the simplicity of the prediction network in RNN-T poses an overfitting issue, especially under challenging scenarios, resulting in degraded performance. In this paper, we propose a masked self-distillation (MSD) training strategy that avoids RNN-Ts overly relying on prediction networks to alleviate overfitting. Such training enables masked non-autoregressive (NAR) decoding, which fully masks the RNN-T predictor output during KWS decoding. In addition, we propose a semi-autoregressive (SAR) decoding approach to integrate the advantages of AR and NAR decoding. Our experiments across multiple KWS datasets demonstrate that MSD training effectively alleviates overfitting. The SAR decoding method preserves the superior performance of AR decoding while benefits from the overfitting suppression of NAR decoding, achieving excellent results.

View on arXiv
@article{xi2025_2505.24820,
  title={ Masked Self-distilled Transducer-based Keyword Spotting with Semi-autoregressive Decoding },
  author={ Yu Xi and Xiaoyu Gu and Haoyu Li and Jun Song and Bo Zheng and Kai Yu },
  journal={arXiv preprint arXiv:2505.24820},
  year={ 2025 }
}
Comments on this paper