Don't Use Large Mini-Batches, Use Local SGD

Mini-batch stochastic gradient methods (SGD) are state of the art for distributed training of deep neural networks. Drastic increases in the mini-batch sizes have lead to key efficiency and scalability gains in recent years. However, progress faces a major roadblock, as models trained with large batches often do not generalize well. Local SGD can offer the same communication vs. computation pattern as mini-batch SGD---thus is as efficient as mini-batch SGD from a systems perspective---but instead of performing a single large-batch update in each round, it performs several local parameter updates sequentially. We extensively study the communication efficiency vs. performance trade-offs associated with local SGD and provide a new variant, called \emph{post-local SGD}. We show that it significantly improves the generalization performance compared to large-batch training and converges to flatter minima.
View on arXiv