Spurious Correlations in High Dimensional Regression: The Roles of Regularization, Simplicity Bias and Over-Parameterization

Learning models have been shown to rely on spurious correlations between non-predictive features and the associated labels in the training data, with negative implications on robustness, bias and fairness. In this work, we provide a statistical characterization of this phenomenon for high-dimensional regression, when the data contains a predictive core feature and a spurious feature . Specifically, we quantify the amount of spurious correlations learned via linear regression, in terms of the data covariance and the strength of the ridge regularization. As a consequence, we first capture the simplicity of through the spectrum of its covariance, and its correlation with through the Schur complement of the full data covariance. Next, we prove a trade-off between and the in-distribution test loss , by showing that the value of that minimizes lies in an interval where is increasing. Finally, we investigate the effects of over-parameterization via the random features model, by showing its equivalence to regularized linear regression. Our theoretical results are supported by numerical experiments on Gaussian, Color-MNIST, and CIFAR-10 datasets.
View on arXiv@article{bombari2025_2502.01347, title={ Spurious Correlations in High Dimensional Regression: The Roles of Regularization, Simplicity Bias and Over-Parameterization }, author={ Simone Bombari and Marco Mondelli }, journal={arXiv preprint arXiv:2502.01347}, year={ 2025 } }