You Cannot Feed Two Birds with One Score: the Accuracy-Naturalness Tradeoff in Translation

31 March 2025

Abstract

The goal of translation, be it by human or by machine, is, given some text in a source language, to produce text in a target language that simultaneously 1) preserves the meaning of the source text and 2) achieves natural expression in the target language. However, researchers in the machine translation community usually assess translations using a single score intended to capture semantic accuracy and the naturalness of the output simultaneously. In this paper, we build on recent advances in information theory to mathematically prove and empirically demonstrate that such single-score summaries do not and cannot give the complete picture of a system's true performance. Concretely, we prove that a tradeoff exists between accuracy and naturalness and demonstrate it by evaluating the submissions to the WMT24 shared task. Our findings help explain well-known empirical phenomena, such as the observation that optimizing translation systems for a specific accuracy metric (like BLEU) initially improves the system's naturalness, while ``overfitting'' the system to the metric can significantly degrade its naturalness. Thus, we advocate for a change in how translations are evaluated: rather than comparing systems using a single number, they should be compared on an accuracy-naturalness plane.

View on arXiv

@article{flamich2025_2503.24013,
  title={ You Cannot Feed Two Birds with One Score: the Accuracy-Naturalness Tradeoff in Translation },
  author={ Gergely Flamich and David Vilar and Jan-Thorsten Peter and Markus Freitag },
  journal={arXiv preprint arXiv:2503.24013},
  year={ 2025 }
}

Comments on this paper