On the generalization error of norm penalty linear regression models

14 November 2022

J. M. Olea

Abstract

We study linear regression problems $\inf_{\boldsymbol{\beta} \in \mathbb{R}^d} \left(\mathbb{E}_{\mathbb{P}_n}[|Y - \mathbf{X^{\top}} \boldsymbol{\beta}|^r]\right)^{1/r} + \delta\rho(\boldsymbol{\beta}),$ with $r\ge 1$ , convex penalty~ $\rho$ , and empirical measure of the data $\mathbb{P}_n$ . Well known examples include the square-root lasso, square-root sorted $\ell_1$ penalization, and penalized least absolute deviations regression. We show that, under benign regularity assumptions on $\rho$ , such procedures naturally provide robust generalization, as the problem can be reformulated as a distributionally robust optimization (DRO) problem for a type of max-sliced Wasserstein ball $B_\delta^\rho(\mathbb{P}_n)$ , i.e. $\widehat{\boldsymbol\beta}$ solves the linear regression problem iff it solves $\inf_{\boldsymbol{\beta} \in \mathbb{R}^d} \sup_{\mathbb{Q}\in B^\rho_\delta(\mathbb{P}_n)} \mathbb{E}_{\mathbb{Q}}[|Y - \mathbf{X^{\top}} \boldsymbol{\beta}|^r].$ Our proof of this result is constructive: it identifies the worst-case measure in the DRO problem, which is given by an additive perturbation of $\mathbb{P}_n$ . We argue that $B_\delta^\rho(\mathbb{P}_n)$ are the natural balls to consider in this framework, as they provide a computationally efficient procedure comparable to non-robust methods and optimal robustness guarantees. In fact, our generalization bounds are of order $d/n$ , up to logarithmic factors, and thus do not suffer from the curse of dimensionality as is the case for known generalization bounds when using the Wasserstein metric on $\mathbb{R}^d$ . Moreover, the bounds provide theoretical support for recommending a regularization parameter $\delta$ of the same order for the linear regression problem.

View on arXiv

Comments on this paper