501
v1v2v3 (latest)

How Over-Parameterization Slows Down Gradient Descent in Matrix Sensing: The Curses of Symmetry and Initialization

International Conference on Learning Representations (ICLR), 2023
Abstract

This paper rigorously shows how over-parameterization changes the convergence behaviors of gradient descent (GD) for the matrix sensing problem, where the goal is to recover an unknown low-rank ground-truth matrix from near-isotropic linear measurements. First, we consider the symmetric setting with the symmetric parameterization where MRn×nM^* \in \mathbb{R}^{n \times n} is a positive semi-definite unknown matrix of rank rnr \ll n, and one uses a symmetric parameterization XXXX^\top to learn MM^*. Here XRn×kX \in \mathbb{R}^{n \times k} with k>rk > r is the factor matrix. We give a novel Ω(1/T2)\Omega (1/T^2) lower bound of randomly initialized GD for the over-parameterized case (k>rk >r) where TT is the number of iterations. This is in stark contrast to the exact-parameterization scenario (k=rk=r) where the convergence rate is exp(Ω(T))\exp (-\Omega (T)). Next, we study asymmetric setting where MRn1×n2M^* \in \mathbb{R}^{n_1 \times n_2} is the unknown matrix of rank rmin{n1,n2}r \ll \min\{n_1,n_2\}, and one uses an asymmetric parameterization FGFG^\top to learn MM^* where FRn1×kF \in \mathbb{R}^{n_1 \times k} and GRn2×kG \in \mathbb{R}^{n_2 \times k}. Building on prior work, we give a global exact convergence result of randomly initialized GD for the exact-parameterization case (k=rk=r) with an exp(Ω(T))\exp (-\Omega(T)) rate. Furthermore, we give the first global exact convergence result for the over-parameterization case (k>rk>r) with an exp(Ω(α2T))\exp(-\Omega(\alpha^2 T)) rate where α\alpha is the initialization scale. This linear convergence result in the over-parameterization case is especially significant because one can apply the asymmetric parameterization to the symmetric setting to speed up from Ω(1/T2)\Omega (1/T^2) to linear convergence. On the other hand, we propose a novel method that only modifies one step of GD and obtains a convergence rate independent of α\alpha, recovering the rate in the exact-parameterization case.

View on arXiv
Comments on this paper