369

On the generalization error of norm penalty linear regression models

Abstract

We study linear regression problems infβRd(EPn[YXβr])1/r+δρ(β),\inf_{\boldsymbol{\beta} \in \mathbb{R}^d} \left(\mathbb{E}_{\mathbb{P}_n}[|Y - \mathbf{X^{\top}} \boldsymbol{\beta}|^r]\right)^{1/r} + \delta\rho(\boldsymbol{\beta}), with r1r\ge 1, convex penalty~ρ\rho, and empirical measure of the data Pn\mathbb{P}_n. Well known examples include the square-root lasso, square-root sorted 1\ell_1 penalization, and penalized least absolute deviations regression. We show that, under benign regularity assumptions on ρ\rho, such procedures naturally provide robust generalization, as the problem can be reformulated as a distributionally robust optimization (DRO) problem for a type of max-sliced Wasserstein ball Bδρ(Pn)B_\delta^\rho(\mathbb{P}_n), i.e. β^\widehat{\boldsymbol\beta} solves the linear regression problem iff it solves infβRdsupQBδρ(Pn)EQ[YXβr].\inf_{\boldsymbol{\beta} \in \mathbb{R}^d} \sup_{\mathbb{Q}\in B^\rho_\delta(\mathbb{P}_n)} \mathbb{E}_{\mathbb{Q}}[|Y - \mathbf{X^{\top}} \boldsymbol{\beta}|^r]. Our proof of this result is constructive: it identifies the worst-case measure in the DRO problem, which is given by an additive perturbation of Pn\mathbb{P}_n. We argue that Bδρ(Pn)B_\delta^\rho(\mathbb{P}_n) are the natural balls to consider in this framework, as they provide a computationally efficient procedure comparable to non-robust methods and optimal robustness guarantees. In fact, our generalization bounds are of order d/nd/n, up to logarithmic factors, and thus do not suffer from the curse of dimensionality as is the case for known generalization bounds when using the Wasserstein metric on Rd\mathbb{R}^d. Moreover, the bounds provide theoretical support for recommending a regularization parameter δ\delta of the same order for the linear regression problem.

View on arXiv
Comments on this paper