Instruction tuning has emerged to enhance the capabilities of large language
models (LLMs) to comprehend instructions and generate appropriate responses.
Existing methods either manually annotate or employ LLM (e.g., GPT-series) to
generate data for instruction tuning. However, they often overlook associating
instructions with existing annotated datasets. In this paper, we propose
Dynosaur, a dynamic growth paradigm for the automatic curation of
instruction-tuning data. Based on the metadata of existing datasets, we use
LLMs to automatically construct instruction-tuning data by identifying relevant
data fields and generating appropriate instructions.
By leveraging the existing annotated datasets, Dynosaur offers several
advantages: 1) it reduces the API cost for generating instructions (e.g., it
costs less than 12USDbycallingGPT−3.5−turboforgenerating800Kinstructiontuningsamples;2)itprovideshigh−qualitydataforinstructiontuning(e.g.,itperformsbetterthanAlpacaandFlanonSuper−NIandLongformwithcomparabledatasizes);and3)itsupportsthecontinuousimprovementofmodelsbygeneratinginstruction−tuningdatawhenanewannotateddatasetbecomesavailable.Wefurtherinvestigateacontinuallearningschemeforlearningwiththeever−growinginstruction−tuningdataset,anddemonstratethatreplayingtaskswithdiverseinstructionembeddingsnotonlyhelpsmitigateforgettingissuesbutgeneralizestounseentasksbetter.Codeanddataareavailableathttps://github.com/WadeYin9712/Dynosaur.