41
0

CDS: Data Synthesis Method Guided by Cognitive Diagnosis Theory

Abstract

Large Language Models (LLMs) have achieved significant advancements, but the increasing complexity of tasks and higher performance demands highlight the need for continuous improvement. Some approaches utilize synthetic data generated by advanced LLMs based on evaluation results to train models. However, conventional evaluation methods fail to provide detailed, fine-grained profiles of LLMs, limiting their guidance for data synthesis. In this paper, we introduce the Cognitive Diagnostic Synthesis (CDS) method, which incorporates a diagnostic process inspired by Cognitive Diagnosis Theory (CDT) to refine evaluation results and characterize model profiles at the knowledge component level. Based on these diagnostics, we propose two diagnosis-synthesis strategies for weakness-targeted data synthesis. Additionally, we present an enhanced data augmentation and selection pipeline to improve the quality and diversity of synthesized data. Our experiments with several open-source models show significant improvements across multiple benchmarks, achieving up to 6.00% improvement in code generation, 13.10% in mathematical reasoning, and 5.43% in academic exams. Code and data are available on GitHub.

View on arXiv
@article{zhao2025_2501.07674,
  title={ CDS: Data Synthesis Method Guided by Cognitive Diagnosis Theory },
  author={ Haokun Zhao and Jinyi Han and Jiaqing Liang and Yanghua Xiao },
  journal={arXiv preprint arXiv:2501.07674},
  year={ 2025 }
}
Comments on this paper