113

Complexity Lower Bounds of Adaptive Gradient Algorithms for Non-convex Stochastic Optimization under Relaxed Smoothness

International Conference on Learning Representations (ICLR), 2025
Main:10 Pages
1 Figures
Bibliography:3 Pages
2 Tables
Appendix:39 Pages
Abstract

Recent results in non-convex stochastic optimization demonstrate the convergence of popular adaptive algorithms (e.g., AdaGrad) under the (L0,L1)(L_0, L_1)-smoothness condition, but the rate of convergence is a higher-order polynomial in terms of problem parameters like the smoothness constants. The complexity guaranteed by such algorithms to find an ϵ\epsilon-stationary point may be significantly larger than the optimal complexity of Θ(ΔLσ2ϵ4)\Theta \left( \Delta L \sigma^2 \epsilon^{-4} \right) achieved by SGD in the LL-smooth setting, where Δ\Delta is the initial optimality gap, σ2\sigma^2 is the variance of stochastic gradient. However, it is currently not known whether these higher-order dependencies can be tightened. To answer this question, we investigate complexity lower bounds for several adaptive optimization algorithms in the (L0,L1)(L_0, L_1)-smooth setting, with a focus on the dependence in terms of problem parameters Δ,L0,L1\Delta, L_0, L_1. We provide complexity bounds for three variations of AdaGrad, which show at least a quadratic dependence on problem parameters Δ,L0,L1\Delta, L_0, L_1. Notably, we show that the decorrelated variant of AdaGrad-Norm requires at least Ω(Δ2L12σ2ϵ4)\Omega \left( \Delta^2 L_1^2 \sigma^2 \epsilon^{-4} \right) stochastic gradient queries to find an ϵ\epsilon-stationary point. We also provide a lower bound for SGD with a broad class of adaptive stepsizes. Our results show that, for certain adaptive algorithms, the (L0,L1)(L_0, L_1)-smooth setting is fundamentally more difficult than the standard smooth setting, in terms of the initial optimality gap and the smoothness constants.

View on arXiv
Comments on this paper