Attention-Based End-to-End Speech Recognition on Voice Search

22 July 2017

Yujun Wang

Lei Xie

Abstract

Recently, there has been an increasing interest in end-to-end speech recognition that directly transcribes speech to text without any predefined alignments. In this paper, we explore the use of attention-based encoder-decoder model for Mandarin speech recognition on voice search. We propose a smoothing method for attention mechanism and compare with content attention and convolutional attention. Moreover, frame skipping is employed for fast training and convergence. On the XiaoMi TV voice search dataset, we achieve a character error rate (CER) of 3.58% and a sentence error rate (SER) of 7.43% without using any lexicon or language model. While together with a trigram language model, we reach 2.81% CER and 5.77% SER.

View on arXiv

Comments on this paper