24
0

Detecting Spelling and Grammatical Anomalies in Russian Poetry Texts

Abstract

The quality of natural language texts in fine-tuning datasets plays a critical role in the performance of generative models, particularly in computational creativity tasks such as poem or song lyric generation. Fluency defects in generated poems significantly reduce their value. However, training texts are often sourced from internet-based platforms without stringent quality control, posing a challenge for data engineers to manage defect levels effectively.To address this issue, we propose the use of automated linguistic anomaly detection to identify and filter out low-quality texts from training datasets for creative models. In this paper, we present a comprehensive comparison of unsupervised and supervised text anomaly detection approaches, utilizing both synthetic and human-labeled datasets. We also introduce the RUPOR dataset, a collection of Russian-language human-labeled poems designed for cross-sentence grammatical error detection, and provide the full evaluation code. Our work aims to empower the community with tools and insights to improve the quality of training datasets for generative models in creative domains.

View on arXiv
@article{koziev2025_2505.04507,
  title={ Detecting Spelling and Grammatical Anomalies in Russian Poetry Texts },
  author={ Ilya Koziev },
  journal={arXiv preprint arXiv:2505.04507},
  year={ 2025 }
}
Comments on this paper