ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 1803.09357
149
56
v1v2 (latest)

Minimizing Nonconvex Population Risk from Rough Empirical Risk

25 March 2018
Chi Jin
Lydia T. Liu
Rong Ge
Michael I. Jordan
    FedML
ArXiv (abs)PDFHTML
Abstract

Population risk---the expectation of the loss over the sampling mechanism---is always of primary interest in machine learning. However, learning algorithms only have access to empirical risk, which is the average loss over training examples. Although the two risks are typically guaranteed to be pointwise close, for applications with nonconvex nonsmooth losses (such as modern deep networks), the effects of sampling can transform a well-behaved population risk into an empirical risk with a landscape that is problematic for optimization. The empirical risk can be nonsmooth, and it may have many additional local minima. This paper considers a general optimization framework which aims to find approximate local minima of a smooth nonconvex function FFF (population risk) given only access to the function value of another function fff (empirical risk), which is pointwise close to FFF (i.e., ∥F−f∥∞≤ν\|F-f\|_{\infty} \le \nu∥F−f∥∞​≤ν). We propose a simple algorithm based on stochastic gradient descent (SGD) on a smoothed version of fff which is guaranteed to find an ϵ\epsilonϵ-second-order stationary point if ν≤O(ϵ1.5/d)\nu \le O(\epsilon^{1.5}/d)ν≤O(ϵ1.5/d), thus escaping all saddle points of FFF and all the additional local minima introduced by fff. We also provide an almost matching lower bound showing that our SGD-based approach achieves the optimal trade-off between ν\nuν and ϵ\epsilonϵ, as well as the optimal dependence on problem dimension ddd, among all algorithms making a polynomial number of queries. As a concrete example, we show that our results can be directly used to give sample complexities for learning a ReLU unit, whose empirical risk is nonsmooth.

View on arXiv
Comments on this paper