Improving Retrieval-Augmented Neural Machine Translation with Monolingual Data

30 April 2025

Abstract

Conventional retrieval-augmented neural machine translation (RANMT) systems leverage bilingual corpora, e.g., translation memories (TMs). Yet, in many settings, in-domain monolingual target-side corpora are often available. This work explores ways to take advantage of such resources by retrieving relevant segments directly in the target language, based on a source-side query. For this, we design improved cross-lingual retrieval systems, trained with both sentence level and word-level matching objectives. In our experiments with two RANMT architectures, we first demonstrate the benefits of such cross-lingual objectives in a controlled setting, obtaining translation performances that surpass standard TM-based models. We then showcase our method on a real-world set-up, where the target monolingual resources far exceed the amount of parallel data and observe large improvements of our new techniques, which outperform both the baseline setting, and general-purpose cross-lingual retrievers.

View on arXiv

@article{bouthors2025_2504.21747,
  title={ Improving Retrieval-Augmented Neural Machine Translation with Monolingual Data },
  author={ Maxime Bouthors and Josep Crego and François Yvon },
  journal={arXiv preprint arXiv:2504.21747},
  year={ 2025 }
}

Comments on this paper