Observation Noise and Initialization in Wide Neural Networks

Performing gradient descent in a wide neural network is equivalent to computing the posterior mean of a Gaussian Process with the Neural Tangent Kernel (NTK-GP), for a specific choice of prior mean and with zero observation noise. However, existing formulations of this result have two limitations: i) the resultant NTK-GP assumes no noise in the observed target variables, which can result in suboptimal predictions with noisy data; ii) it is unclear how to extend the equivalence to an arbitrary prior mean, a crucial aspect of formulating a well-specified model. To address the first limitation, we introduce a regularizer into the neural network's training objective, formally showing its correspondence to incorporating observation noise into the NTK-GP model. To address the second, we introduce a \textit{shifted network} that enables arbitrary prior mean functions. This approach allows us to perform gradient descent on a single neural network, without expensive ensembling or kernel matrix inversion. Our theoretical insights are validated empirically, with experiments exploring different values of observation noise and network architectures.
View on arXiv@article{calvo-ordoñez2025_2502.01556, title={ Observation Noise and Initialization in Wide Neural Networks }, author={ Sergio Calvo-Ordoñez and Jonathan Plenk and Richard Bergna and Alvaro Cartea and Jose Miguel Hernandez-Lobato and Konstantina Palla and Kamil Ciosek }, journal={arXiv preprint arXiv:2502.01556}, year={ 2025 } }