25
1

Second-order Information Promotes Mini-Batch Robustness in Variance-Reduced Gradients

Abstract

We show that, for finite-sum minimization problems, incorporating partial second-order information of the objective function can dramatically improve the robustness to mini-batch size of variance-reduced stochastic gradient methods, making them more scalable while retaining their benefits over traditional Newton-type approaches. We demonstrate this phenomenon on a prototypical stochastic second-order algorithm, called Mini-Batch Stochastic Variance-Reduced Newton (Mb-SVRN\texttt{Mb-SVRN}), which combines variance-reduced gradient estimates with access to an approximate Hessian oracle. In particular, we show that when the data size nn is sufficiently large, i.e., nα2κn\gg \alpha^2\kappa, where κ\kappa is the condition number and α\alpha is the Hessian approximation factor, then Mb-SVRN\texttt{Mb-SVRN} achieves a fast linear convergence rate that is independent of the gradient mini-batch size bb, as long bb is in the range between 11 and bmax=O(n/(αlogn))b_{\max}=O(n/(\alpha \log n)). Only after increasing the mini-batch size past this critical point bmaxb_{\max}, the method begins to transition into a standard Newton-type algorithm which is much more sensitive to the Hessian approximation quality. We demonstrate this phenomenon empirically on benchmark optimization tasks showing that, after tuning the step size, the convergence rate of Mb-SVRN\texttt{Mb-SVRN} remains fast for a wide range of mini-batch sizes, and the dependence of the phase transition point bmaxb_{\max} on the Hessian approximation factor α\alpha aligns with our theoretical predictions.

View on arXiv
Comments on this paper