659

On the Convergence Rate of Training Recurrent Neural Networks

Abstract

Despite the huge success of deep learning, our understanding to how the non-convex neural networks are trained remains rather limited. Most of existing theoretical works only tackle neural networks with one hidden layer, and little is known for multi-layer neural networks. Recurrent neural networks (RNNs) are special multi-layer networks extensively used in natural language processing applications. They are particularly hard to analyze, comparing to feedforward networks, because the weight parameters are reused across the entire time horizon. We provide arguably the first theoretical understanding to the convergence speed of training RNNs (when activation functions are present). Specifically, when the weights are randomly initialized, and when the number of neurons is sufficiently large ---meaning polynomial in the training data size and the time horizon--- we show that gradient descent and stochastic gradient descent both minimize the regression loss in a linear convergence rate, that is, εeΩ(T)\varepsilon \propto e^{-\Omega(T)}.

View on arXiv
Comments on this paper