Genetic Instruct: Scaling up Synthetic Generation of Coding Instructions for Large Language Models

29 July 2024

Wasi Uddin Ahmad

Jocelyn Huang

Jagadeesh Balam

Boris Ginsburg

SyDa

ArXiv PDF HTML

Abstract

Large Language Models (LLMs) require high quality instruction data for effective alignment, particularly in code generation tasks where expert curated datasets are expensive to produce. We present Genetic-Instruct, a scalable algorithm for synthesizing large-scale, high quality coding instructions using evolutionary principles. Starting from a small set of seed instructions, Genetic-Instruct generates diverse and challenging instruction-code pairs by leveraging an Instructor-LLM for generation, a Coder-LLM for code synthesis, and a Judge-LLM for automatic quality evaluation. Our proposed approach is highly parallelizable and effective even with a small seed data and weaker generator models. We generated more than 7.5 million coding instructions with the proposed approach. Then we evaluated it by fine-tuning LLMs with the synthetic samples and demonstrated a significant improvement in their code generation capability compared to the other synthetic generation approaches and publicly available datasets. Our results highlight the efficiency, scalability, and generalizability of the Genetic-Instruct framework.

View on arXiv

@article{majumdar2025_2407.21077,
  title={ Genetic Instruct: Scaling up Synthetic Generation of Coding Instructions for Large Language Models },
  author={ Somshubra Majumdar and Vahid Noroozi and Mehrzad Samadi and Sean Narenthiran and Aleksander Ficek and Wasi Uddin Ahmad and Jocelyn Huang and Jagadeesh Balam and Boris Ginsburg },
  journal={arXiv preprint arXiv:2407.21077},
  year={ 2025 }
}

Comments on this paper