229
v1v2 (latest)

On the Regularization Effect of Stochastic Gradient Descent applied to Least Squares

Abstract

We study the behavior of stochastic gradient descent applied to Axb22min\|Ax -b \|_2^2 \rightarrow \min for invertible ARn×nA \in \mathbb{R}^{n \times n}. We show that there is an explicit constant cAc_{A} depending (mildly) on AA such that E Axk+1b22(1+cAAF2)Axkb222AF2ATA(xkx)22. \mathbb{E} ~\left\| Ax_{k+1}-b\right\|^2_{2} \leq \left(1 + \frac{c_{A}}{\|A\|_F^2}\right) \left\|A x_k -b \right\|^2_{2} - \frac{2}{\|A\|_F^2} \left\|A^T A (x_k - x)\right\|^2_{2}. This is a curious inequality: the last term has one more matrix applied to the residual ukuu_k - u than the remaining terms: if xkxx_k - x is mainly comprised of large singular vectors, stochastic gradient descent leads to a quick regularization. For symmetric matrices, this inequality has an extension to higher-order Sobolev spaces. This explains a (known) regularization phenomenon: an energy cascade from large singular values to small singular values smoothes.

View on arXiv
Comments on this paper