Minimax Optimal Convergence of Gradient Descent in Logistic Regression via Large and Adaptive Stepsizes

5 April 2025

Abstract

We study $\textit{gradient descent}$ (GD) for logistic regression on linearly separable data with stepsizes that adapt to the current risk, scaled by a constant hyperparameter $\eta$ . We show that after at most $1/\gamma^2$ burn-in steps, GD achieves a risk upper bounded by $\exp(-\Theta(\eta))$ , where $\gamma$ is the margin of the dataset. As $\eta$ can be arbitrarily large, GD attains an arbitrarily small risk $\textit{immediately after the burn-in steps}$ , though the risk evolution may be $\textit{non-monotonic}$ .We further construct hard datasets with margin $\gamma$ , where any batch (or online) first-order method requires $\Omega(1/\gamma^2)$ steps to find a linear separator. Thus, GD with large, adaptive stepsizes is $\textit{minimax optimal}$ among first-order batch methods. Notably, the classical $\textit{Perceptron}$ (Novikoff, 1962), a first-order online method, also achieves a step complexity of $1/\gamma^2$ , matching GD even in constants.Finally, our GD analysis extends to a broad class of loss functions and certain two-layer networks.

View on arXiv

@article{zhang2025_2504.04105,
  title={ Minimax Optimal Convergence of Gradient Descent in Logistic Regression via Large and Adaptive Stepsizes },
  author={ Ruiqi Zhang and Jingfeng Wu and Licong Lin and Peter L. Bartlett },
  journal={arXiv preprint arXiv:2504.04105},
  year={ 2025 }
}

Comments on this paper