17
121

On the Optimal Weighted 2\ell_2 Regularization in Overparameterized Linear Regression

Abstract

We consider the linear model y=Xβ+ϵ\mathbf{y} = \mathbf{X} \mathbf{\beta}_\star + \mathbf{\epsilon} with XRn×p\mathbf{X}\in \mathbb{R}^{n\times p} in the overparameterized regime p>np>n. We estimate β\mathbf{\beta}_\star via generalized (weighted) ridge regression: β^λ=(XTX+λΣw)XTy\hat{\mathbf{\beta}}_\lambda = \left(\mathbf{X}^T\mathbf{X} + \lambda \mathbf{\Sigma}_w\right)^\dagger \mathbf{X}^T\mathbf{y}, where Σw\mathbf{\Sigma}_w is the weighting matrix. Under a random design setting with general data covariance Σx\mathbf{\Sigma}_x and anisotropic prior on the true coefficients EββT=Σβ\mathbb{E}\mathbf{\beta}_\star\mathbf{\beta}_\star^T = \mathbf{\Sigma}_\beta, we provide an exact characterization of the prediction risk E(yxTβ^λ)2\mathbb{E}(y-\mathbf{x}^T\hat{\mathbf{\beta}}_\lambda)^2 in the proportional asymptotic limit p/nγ(1,)p/n\rightarrow \gamma \in (1,\infty). Our general setup leads to a number of interesting findings. We outline precise conditions that decide the sign of the optimal setting λopt\lambda_{\rm opt} for the ridge parameter λ\lambda and confirm the implicit 2\ell_2 regularization effect of overparameterization, which theoretically justifies the surprising empirical observation that λopt\lambda_{\rm opt} can be negative in the overparameterized regime. We also characterize the double descent phenomenon for principal component regression (PCR) when both X\mathbf{X} and β\mathbf{\beta}_\star are anisotropic. Finally, we determine the optimal weighting matrix Σw\mathbf{\Sigma}_w for both the ridgeless (λ0\lambda\to 0) and optimally regularized (λ=λopt\lambda = \lambda_{\rm opt}) case, and demonstrate the advantage of the weighted objective over standard ridge regression and PCR.

View on arXiv
Comments on this paper