315

"Listen, Understand and Translate": Triple Supervision Decouples End-to-end Speech-to-text Translation

Abstract

An end-to-end speech-to-text translation (ST) takes audio in a source language and outputs the text in a target language. Inspired by neuroscience, humans have perception systems and cognitive systems to process different information, we propose LUT, Listen-Understand-Translate, a unified framework with triple supervision to decouple the end-to-end speech-to-text translation task. In addition to the target language sentence translation loss, LUT includes two auxiliary supervising signals to guide the acoustic encoder to extracts acoustic features from the input, and the semantic encoder to extract semantic features relevant to the source transcription text. We do experiments on English-French, English-German and English-Chinese speech translation benchmarks and the results demonstrate the reasonability of LUT. Our code is available at https://github.com/dqqcasia/st.

View on arXiv
Comments on this paper