Dual-Path Style Learning for End-to-End Noise-Robust Speech Recognition
Automatic speech recognition (ASR) systems degrade significantly in face of noisy conditions. Recently, speech enhancement (SE) has been introduced as front-end module to reduce noise and improve speech quality for ASR, but it would also suppress some important speech information, i.e., over-suppression problem. To alleviate this, we propose a dual-path style learning approach for end-to-end noise-robust automatic speech recognition (DPSL-ASR). Specifically, we first introduce clean speech feature along with the fused feature from previously proposed IFF-Net as dual-path inputs to recover the over-suppressed information. Then, we propose a style learning method to map the fused feature close to clean feature, in order to learn latent speech information from the latter, i.e., clean "speech style". Furthermore, we employ consistency loss to minimize the distance of ASR outputs in two paths to improve noise-robustness. Experimental results show that the proposed approach achieves relative word error rate (WER) reductions of 10.6% and 8.6% over the best IFF-Net baseline, on RATS Channel-A and CHiME-4 1-Channel Track datasets, respectively. Visualizations of intermediate embeddings indicate that DPSL-ASR can recover abundant over-suppressed information in enhanced speech. Our code is available at GitHub: https://github.com/YUCHEN005/DPSL-ASR.
View on arXiv