ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2309.08131
11
5

t-SOT FNT: Streaming Multi-talker ASR with Text-only Domain Adaptation Capability

15 September 2023
Jian Wu
Naoyuki Kanda
Takuya Yoshioka
Rui Zhao
Zhuo Chen
Jinyu Li
ArXivPDFHTML
Abstract

Token-level serialized output training (t-SOT) was recently proposed to address the challenge of streaming multi-talker automatic speech recognition (ASR). T-SOT effectively handles overlapped speech by representing multi-talker transcriptions as a single token stream with ⟨cc⟩\langle \text{cc}\rangle⟨cc⟩ symbols interspersed. However, the use of a naive neural transducer architecture significantly constrained its applicability for text-only adaptation. To overcome this limitation, we propose a novel t-SOT model structure that incorporates the idea of factorized neural transducers (FNT). The proposed method separates a language model (LM) from the transducer's predictor and handles the unnatural token order resulting from the use of ⟨cc⟩\langle \text{cc}\rangle⟨cc⟩ symbols in t-SOT. We achieve this by maintaining multiple hidden states and introducing special handling of the ⟨cc⟩\langle \text{cc}\rangle⟨cc⟩ tokens within the LM. The proposed t-SOT FNT model achieves comparable performance to the original t-SOT model while retaining the ability to reduce word error rate (WER) on both single and multi-talker datasets through text-only adaptation.

View on arXiv
Comments on this paper