ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2310.01769
9
11

How Over-Parameterization Slows Down Gradient Descent in Matrix Sensing: The Curses of Symmetry and Initialization

3 October 2023
Nuoya Xiong
Lijun Ding
Simon S. Du
ArXivPDFHTML
Abstract

This paper rigorously shows how over-parameterization changes the convergence behaviors of gradient descent (GD) for the matrix sensing problem, where the goal is to recover an unknown low-rank ground-truth matrix from near-isotropic linear measurements. First, we consider the symmetric setting with the symmetric parameterization where M∗∈Rn×nM^* \in \mathbb{R}^{n \times n}M∗∈Rn×n is a positive semi-definite unknown matrix of rank r≪nr \ll nr≪n, and one uses a symmetric parameterization XX⊤XX^\topXX⊤ to learn M∗M^*M∗. Here X∈Rn×kX \in \mathbb{R}^{n \times k}X∈Rn×k with k>rk > rk>r is the factor matrix. We give a novel Ω(1/T2)\Omega (1/T^2)Ω(1/T2) lower bound of randomly initialized GD for the over-parameterized case (k>rk >rk>r) where TTT is the number of iterations. This is in stark contrast to the exact-parameterization scenario (k=rk=rk=r) where the convergence rate is exp⁡(−Ω(T))\exp (-\Omega (T))exp(−Ω(T)). Next, we study asymmetric setting where M∗∈Rn1×n2M^* \in \mathbb{R}^{n_1 \times n_2}M∗∈Rn1​×n2​ is the unknown matrix of rank r≪min⁡{n1,n2}r \ll \min\{n_1,n_2\}r≪min{n1​,n2​}, and one uses an asymmetric parameterization FG⊤FG^\topFG⊤ to learn M∗M^*M∗ where F∈Rn1×kF \in \mathbb{R}^{n_1 \times k}F∈Rn1​×k and G∈Rn2×kG \in \mathbb{R}^{n_2 \times k}G∈Rn2​×k. Building on prior work, we give a global exact convergence result of randomly initialized GD for the exact-parameterization case (k=rk=rk=r) with an exp⁡(−Ω(T))\exp (-\Omega(T))exp(−Ω(T)) rate. Furthermore, we give the first global exact convergence result for the over-parameterization case (k>rk>rk>r) with an exp⁡(−Ω(α2T))\exp(-\Omega(\alpha^2 T))exp(−Ω(α2T)) rate where α\alphaα is the initialization scale. This linear convergence result in the over-parameterization case is especially significant because one can apply the asymmetric parameterization to the symmetric setting to speed up from Ω(1/T2)\Omega (1/T^2)Ω(1/T2) to linear convergence. On the other hand, we propose a novel method that only modifies one step of GD and obtains a convergence rate independent of α\alphaα, recovering the rate in the exact-parameterization case.

View on arXiv
Comments on this paper