170
v1v2 (latest)

D-SCoRE: Document-Centric Segmentation and CoT Reasoning with Structured Export for QA-CoT Data Generation

Main:8 Pages
3 Figures
Bibliography:2 Pages
6 Tables
Appendix:8 Pages
Abstract

The scarcity and high cost of high-quality domain-specific question-answering (QA) datasets limit supervised fine-tuning of large language models (LLMs). We introduce D-SCoRE\textbf{D-SCoRE}, a training-free framework that leverages LLMs and prompt engineering to automatically generate diverse, rich QA datasets with Chain-of-Thought (CoT) from arbitrary textual sources. By integrating D\textbf{D}ocument-centric processing, S\textbf{S}egmentation, Co\textbf{Co}T R\textbf{R}easoning, and structured E\textbf{E}xport - along with multi-dimensional controls such as semantic role transformation, question type balancing, and counterfactual augmentation - D-SCoRE produces tailored QA pairs with enhanced diversity and relevance. LLMs fine-tuned on D-SCoRE-generated datasets outperform those trained on human-annotated QA data across most evaluated domains. Its efficiency and scalability enable rapid, high-performance domain-adaptive fine-tuning on consumer-grade hardware, generating over 1,100 high-quality QA pairs per GPU-hour end-to-end.

View on arXiv
Comments on this paper