A Large-Scale Benchmark for Vietnamese Sentence Paraphrases

This paper presents ViSP, a high-quality Vietnamese dataset for sentence paraphrasing, consisting of 1.2M original-paraphrase pairs collected from various domains. The dataset was constructed using a hybrid approach that combines automatic paraphrase generation with manual evaluation to ensure high quality. We conducted experiments using methods such as back-translation, EDA, and baseline models like BART and T5, as well as large language models (LLMs), including GPT-4o, Gemini-1.5, Aya, Qwen-2.5, and Meta-Llama-3.1 variants. To the best of our knowledge, this is the first large-scale study on Vietnamese paraphrasing. We hope that our dataset and findings will serve as a valuable foundation for future research and applications in Vietnamese paraphrase tasks.
View on arXiv@article{nguyen2025_2502.07188, title={ A Large-Scale Benchmark for Vietnamese Sentence Paraphrases }, author={ Sang Quang Nguyen and Kiet Van Nguyen }, journal={arXiv preprint arXiv:2502.07188}, year={ 2025 } }