Variety and Quality over Quantity: Towards Versatile Instruction
Curation
- ALM
Instruction fine-tuning, involving the refinement of pre-trained LLMs using datasets accompanied by natural instructions, is a powerful approach. However, its effectiveness is hindered by the redundancy and deficiencies in LLM-generated instruction datasets. In this paper, we introduce a highly effective and versatile paradigm for selecting diverse and high-quality instruction-following data from fine-tuning datasets. We first employ the dataset enhancement and expansion to augment the dataset with more diverse and high-quality data, then we apply variety compression and quality compression sequentially to curate the desired dataset. Our experimental results showcase that, even with a limited quantity of high-quality instruction data, LLMs consistently maintain robust performance across both natural language understanding tasks and code generation tasks. Notably, they outperform models trained on significantly larger instruction datasets in certain instances.
View on arXiv