366
v1v2v3v4 (latest)

One-shot Entropy Minimization

Main:12 Pages
7 Figures
Bibliography:1 Pages
2 Tables
Appendix:1 Pages
Abstract

We trained 13,440 large language models and found that entropy minimization requires only a single unlabeled data and 10 steps optimization to achieve performance improvements comparable to or even greater than those obtained using thousands of data and carefully designed rewards in rule-based reinforcement learning. This striking result may prompt a rethinking of post-training paradigms for large language models. Our code is avaliable atthis https URL.

View on arXiv
Comments on this paper