ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2410.13954
50
2

Nonlinear Stochastic Gradient Descent and Heavy-tailed Noise: A Unified Framework and High-probability Guarantees

17 October 2024
Aleksandar Armacki
Shuhua Yu
Pranay Sharma
Gauri Joshi
Dragana Bajović
D. Jakovetić
S. Kar
ArXivPDFHTML
Abstract

We study high-probability convergence in online learning, in the presence of heavy-tailed noise. To combat the heavy tails, a general framework of nonlinear SGD methods is considered, subsuming several popular nonlinearities like sign, quantization, component-wise and joint clipping. In our work the nonlinearity is treated in a black-box manner, allowing us to establish unified guarantees for a broad range of nonlinear methods. For symmetric noise and non-convex costs we establish convergence of gradient norm-squared, at a rate O~(t−1/4)\widetilde{\mathcal{O}}(t^{-1/4})O(t−1/4), while for the last iterate of strongly convex costs we establish convergence to the population optima, at a rate O(t−ζ)\mathcal{O}(t^{-\zeta})O(t−ζ), where ζ∈(0,1)\zeta \in (0,1)ζ∈(0,1) depends on noise and problem parameters. Further, if the noise is a (biased) mixture of symmetric and non-symmetric components, we show convergence to a neighbourhood of stationarity, whose size depends on the mixture coefficient, nonlinearity and noise. Compared to state-of-the-art, who only consider clipping and require unbiased noise with bounded ppp-th moments, p∈(1,2]p \in (1,2]p∈(1,2], we provide guarantees for a broad class of nonlinearities, without any assumptions on noise moments. While the rate exponents in state-of-the-art depend on noise moments and vanish as p→1p \rightarrow 1p→1, our exponents are constant and strictly better whenever p<6/5p < 6/5p<6/5 for non-convex and p<8/7p < 8/7p<8/7 for strongly convex costs. Experiments validate our theory, showing that clipping is not always the optimal nonlinearity, further underlining the value of a general framework.

View on arXiv
@article{armacki2025_2410.13954,
  title={ Nonlinear Stochastic Gradient Descent and Heavy-tailed Noise: A Unified Framework and High-probability Guarantees },
  author={ Aleksandar Armacki and Shuhua Yu and Pranay Sharma and Gauri Joshi and Dragana Bajovic and Dusan Jakovetic and Soummya Kar },
  journal={arXiv preprint arXiv:2410.13954},
  year={ 2025 }
}
Comments on this paper