Improving Retrieval-Augmented Neural Machine Translation with Monolingual Data

Conventional retrieval-augmented neural machine translation (RANMT) systems leverage bilingual corpora, e.g., translation memories (TMs). Yet, in many settings, in-domain monolingual target-side corpora are often available. This work explores ways to take advantage of such resources by retrieving relevant segments directly in the target language, based on a source-side query. For this, we design improved cross-lingual retrieval systems, trained with both sentence level and word-level matching objectives. In our experiments with two RANMT architectures, we first demonstrate the benefits of such cross-lingual objectives in a controlled setting, obtaining translation performances that surpass standard TM-based models. We then showcase our method on a real-world set-up, where the target monolingual resources far exceed the amount of parallel data and observe large improvements of our new techniques, which outperform both the baseline setting, and general-purpose cross-lingual retrievers.
View on arXiv@article{bouthors2025_2504.21747, title={ Improving Retrieval-Augmented Neural Machine Translation with Monolingual Data }, author={ Maxime Bouthors and Josep Crego and François Yvon }, journal={arXiv preprint arXiv:2504.21747}, year={ 2025 } }