v1v2 (latest)

D-SCoRE: Document-Centric Segmentation and CoT Reasoning with Structured Export for QA-CoT Data Generation

2 August 2025

Main:8 Pages

3 Figures

Bibliography:2 Pages

6 Tables

Appendix:8 Pages

Abstract

The scarcity and high cost of high-quality domain-specific question-answering (QA) datasets limit supervised fine-tuning of large language models (LLMs). We introduce $\textbf{D-SCoRE}$ , a training-free framework that leverages LLMs and prompt engineering to automatically generate diverse, rich QA datasets with Chain-of-Thought (CoT) from arbitrary textual sources. By integrating $\textbf{D}$ ocument-centric processing, $\textbf{S}$ egmentation, $\textbf{Co}$ T $\textbf{R}$ easoning, and structured $\textbf{E}$ xport - along with multi-dimensional controls such as semantic role transformation, question type balancing, and counterfactual augmentation - D-SCoRE produces tailored QA pairs with enhanced diversity and relevance. LLMs fine-tuned on D-SCoRE-generated datasets outperform those trained on human-annotated QA data across most evaluated domains. Its efficiency and scalability enable rapid, high-performance domain-adaptive fine-tuning on consumer-grade hardware, generating over 1,100 high-quality QA pairs per GPU-hour end-to-end.

View on arXiv

Comments on this paper