Instruction Fine-Tuning: Does Prompt Loss Matter?

Conference on Empirical Methods in Natural Language Processing (EMNLP), 2024

24 January 2024

Mathew Huerta-Enochian

ArXiv (abs)PDF HTML Github

Main:9 Pages

11 Figures

Bibliography:2 Pages

6 Tables

Appendix:14 Pages

Abstract

We present a study analyzing the effects of prompt loss weighting (PLW) on supervised instruction fine-tuning. We recreated Stanford's Alpaca experiment with both LLaMA 1 and LLaMA 2 and multiple instruction datasets. We found that performance of models fine-tuned on our short-completion dataset had a statistically significant negative quadratic relationship with PLW, but performance of models fine-tuned on medium- and long-completion data did not show any relationship with PLW. I.e., prompt loss can be safely ignored for many datasets. For short-completion data, small values (0.01-0.1) of PLW were optimal for multiple-choice and short-generation tasks while large values (~ 1.0) of PLW were optimal for long-generation tasks. We concluded that low non-zero PLW encourages models to not diverge from pre-trained model weights during training and high PLW reduces overfitting. Finally, we present a rough guide for selecting PLW values based on the completion-prompt length ratio of fine-tuning data.

View on arXiv

Comments on this paper