286

Training Overparametrized Neural Networks in Sublinear Time

Abstract

The success of deep learning comes at a tremendous computational and energy cost, and the scalability of training massively overparametrized neural networks is becoming a real barrier to the progress of AI. Despite the popularity and low cost-per-iteration of traditional Backpropagation via gradient decent, SGD has prohibitive convergence rate in non-convex settings, both in theory and practice. To mitigate this cost, recent works have proposed to employ alternative (Newton-type) training methods with much faster convergence rate, albeit with higher cost-per-iteration. For a typical neural network with m=poly(n)m=\mathrm{poly}(n) parameters and input batch of nn datapoints in Rd\mathbb{R}^d, the previous work of [Brand, Peng, Song, and Weinstein, ITCS'2021] requires mnd+n3\sim mnd + n^3 time per iteration. In this paper, we present a novel training method that requires only m1αnd+n3m^{1-\alpha} n d + n^3 amortized time in the same overparametrized regime, where α(0.01,1)\alpha \in (0.01,1) is some fixed constant. This method relies on a new and alternative view of neural networks, as a set of binary search trees, where each iteration corresponds to modifying a small subset of the nodes in the tree. We believe this view would have further applications in the design and analysis of DNNs.

View on arXiv
Comments on this paper