Investigating Length Issues in Document-level Machine Translation

23 December 2024

Abstract

Transformer architectures are increasingly effective at processing and generating very long chunks of texts, opening new perspectives for document-level machine translation (MT). In this work, we challenge the ability of MT systems to handle texts comprising up to several thousands of tokens. We design and implement a new approach designed to precisely measure the effect of length increments on MT outputs. Our experiments with two representative architectures unambiguously show that (a)~translation performance decreases with the length of the input text; (b)~the position of sentences within the document matters, and translation quality is higher for sentences occurring earlier in a document. We further show that manipulating the distribution of document lengths and of positional embeddings only marginally mitigates such problems. Our results suggest that even though document-level MT is computationally feasible, it does not yet match the performance of sentence-based MT.

View on arXiv

@article{peng2025_2412.17592,
  title={ Investigating Length Issues in Document-level Machine Translation },
  author={ Ziqian Peng and Rachel Bawden and François Yvon },
  journal={arXiv preprint arXiv:2412.17592},
  year={ 2025 }
}

Comments on this paper