144
v1v2v3v4v5 (latest)

On the O(dT1/4)O(\frac{\sqrt{d}}{T^{1/4}}) Convergence Rate of RMSProp and Its Momentum Extension Measured by 1\ell_1 Norm

Main:17 Pages
1 Figures
Bibliography:3 Pages
Abstract

Although adaptive gradient methods have been extensively used in deep learning, their convergence rates proved in the literature are all slower than that of SGD, particularly with respect to their dependence on the dimension. This paper considers the classical RMSProp and its momentum extension and establishes the convergence rate of 1Tk=1TE[f(xk)1]O(dCT1/4)\frac{1}{T}\sum_{k=1}^T E\left[\|\nabla f(x^k)\|_1\right]\leq O(\frac{\sqrt{d}C}{T^{1/4}}) measured by 1\ell_1 norm without the bounded gradient assumption, where dd is the dimension of the optimization variable, TT is the iteration number, and CC is a constant identical to that appeared in the optimal convergence rate of SGD. Our convergence rate matches the lower bound with respect to all the coefficients except the dimension dd. Since x2x1dx2\|x\|_2\ll\|x\|_1\leq\sqrt{d}\|x\|_2 for problems with extremely large dd, our convergence rate can be considered to be analogous to the 1Tk=1TE[f(xk)2]O(CT1/4)\frac{1}{T}\sum_{k=1}^T E\left[\|\nabla f(x^k)\|_2\right]\leq O(\frac{C}{T^{1/4}}) rate of SGD in the ideal case of f(x)1=Θ(df(x)2)\|\nabla f(x)\|_1=\varTheta(\sqrt{d}\|\nabla f(x)\|_2).

View on arXiv
Comments on this paper