ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2504.15573
17
0

Instruction-Tuning Data Synthesis from Scratch via Web Reconstruction

22 April 2025
Yuxin Jiang
Y. Wang
Chuhan Wu
Xinyi Dai
Yan Xu
Weinan Gan
Y. Wang
Xin Jiang
Lifeng Shang
R. Tang
W. Wang
ArXivPDFHTML
Abstract

The improvement of LLMs' instruction-following capabilities depends critically on the availability of high-quality instruction-response pairs. While existing automatic data synthetic methods alleviate the burden of manual curation, they often rely heavily on either the quality of seed data or strong assumptions about the structure and content of web documents. To tackle these challenges, we propose Web Reconstruction (WebR), a fully automated framework for synthesizing high-quality instruction-tuning (IT) data directly from raw web documents with minimal assumptions. Leveraging the inherent diversity of raw web content, we conceptualize web reconstruction as an instruction-tuning data synthesis task via a novel dual-perspective paradigm--Web as Instruction and Web as Response--where each web document is designated as either an instruction or a response to trigger the reconstruction process. Comprehensive experiments show that datasets generated by WebR outperform state-of-the-art baselines by up to 16.65% across four instruction-following benchmarks. Notably, WebR demonstrates superior compatibility, data efficiency, and scalability, enabling enhanced domain adaptation with minimal effort. The data and code are publicly available atthis https URL.

View on arXiv
@article{jiang2025_2504.15573,
  title={ Instruction-Tuning Data Synthesis from Scratch via Web Reconstruction },
  author={ Yuxin Jiang and Yufei Wang and Chuhan Wu and Xinyi Dai and Yan Xu and Weinan Gan and Yasheng Wang and Xin Jiang and Lifeng Shang and Ruiming Tang and Wei Wang },
  journal={arXiv preprint arXiv:2504.15573},
  year={ 2025 }
}
Comments on this paper