v1v2 (latest)

Small Batch Size Training for Language Models: When Vanilla SGD Works, and Why Gradient Accumulation Is Wasteful

9 July 2025

Papers citing "Small Batch Size Training for Language Models: When Vanilla SGD Works, and Why Gradient Accumulation Is Wasteful"

1 / 1 papers shown

Title
Pre-training under infinite compute Konwoo Kim Suhas Kotha Percy Liang Tatsunori Hashimoto 0 0 0 18 Sep 2025