Recent results show that estimates defined by over-parametrized deep neural networks learned by applying gradient descent to a regularized empirical risk are universally consistent and achieve good rates of convergence. In this paper, we show that the regularization term is not necessary to obtain similar results. In the case of a suitably chosen initialization of the network, a suitable number of gradient descent steps, and a suitable step size we show that an estimate without a regularization term is universally consistent for bounded predictor variables. Additionally, we show that if the regression function is H\"older smooth with H\"older exponent , the error converges to zero with a convergence rate of approximately . Furthermore, in case of an interaction model, where the regression function consists of a sum of H\"older smooth functions with components, a rate of convergence is derived which does not depend on the input dimension .
View on arXiv