We derive finite width and depth corrections for the Neural Tangent Kernel (NTK) of ResNets and DenseNets. Our analysis reveals that finite size residual architectures are initialized much closer to the "kernel regime" than their vanilla counterparts: while in networks that do not use skip connections, convergence to the NTK requires one to fix the depth, while increasing the layers' width. Our findings show that in ResNets, convergence to the NTK may occur when depth and width simultaneously tend to infinity, provided with a proper initialization. In DenseNets, however, convergence of the NTK to its limit as the width tends to infinity is guaranteed, at a rate that is independent of both the depth and scale of the weights. Our experiments validate the theoretical results and demonstrate the advantage of deep ResNets and DenseNets for kernel regression with random gradient features.
View on arXiv