108
0

Spurious Correlations in High Dimensional Regression: The Roles of Regularization, Simplicity Bias and Over-Parameterization

Abstract

Learning models have been shown to rely on spurious correlations between non-predictive features and the associated labels in the training data, with negative implications on robustness, bias and fairness. In this work, we provide a statistical characterization of this phenomenon for high-dimensional regression, when the data contains a predictive core feature xx and a spurious feature yy. Specifically, we quantify the amount of spurious correlations CC learned via linear regression, in terms of the data covariance and the strength λ\lambda of the ridge regularization. As a consequence, we first capture the simplicity of yy through the spectrum of its covariance, and its correlation with xx through the Schur complement of the full data covariance. Next, we prove a trade-off between CC and the in-distribution test loss LL, by showing that the value of λ\lambda that minimizes LL lies in an interval where CC is increasing. Finally, we investigate the effects of over-parameterization via the random features model, by showing its equivalence to regularized linear regression. Our theoretical results are supported by numerical experiments on Gaussian, Color-MNIST, and CIFAR-10 datasets.

View on arXiv
@article{bombari2025_2502.01347,
  title={ Spurious Correlations in High Dimensional Regression: The Roles of Regularization, Simplicity Bias and Over-Parameterization },
  author={ Simone Bombari and Marco Mondelli },
  journal={arXiv preprint arXiv:2502.01347},
  year={ 2025 }
}
Comments on this paper