4
0

VNJPTranslate: A comprehensive pipeline for Vietnamese-Japanese translation

Abstract

Neural Machine Translation (NMT) driven by Transformer architectures has advanced significantly, yet faces challenges with low-resource language pairs like Vietnamese-Japanese (Vi-Ja). Issues include sparse parallel data and handling linguistic/cultural nuances. Recent progress in Large Language Models (LLMs) with strong reasoning, often refined via Reinforcement Learning (RL), enables high-quality synthetic data generation. We introduce VNJPTranslate, a pipeline designed to systematically address the Vi-Ja translation task. It features a targeted data augmentation strategy using advanced LLMs with Chain-of-Thought prompting for challenging segments identified via corpus analysis. Subsequently, we employ efficient fine-tuning techniques (Unsloth with QLoRA) on a capable, low-parameter autoregressive model (specifically, a fine-tuned version of the 1.8B parameter Sailor model, which is based on the Qwen architecture) to create a practical and high-performing translation system. This integrated approach aims to improve Vi-Ja translation quality significantly over existing baselines.

View on arXiv
@article{phan2025_2504.00339,
  title={ VNJPTranslate: A comprehensive pipeline for Vietnamese-Japanese translation },
  author={ Hoang Hai Phan and Nguyen Duc Minh Vu and Nam Dang Phuong },
  journal={arXiv preprint arXiv:2504.00339},
  year={ 2025 }
}
Comments on this paper