ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2109.14142
24
8

On the Provable Generalization of Recurrent Neural Networks

29 September 2021
Lifu Wang
Bo Shen
Bo Hu
Xing Cao
ArXivPDFHTML
Abstract

Recurrent Neural Network (RNN) is a fundamental structure in deep learning. Recently, some works study the training process of over-parameterized neural networks, and show that over-parameterized networks can learn functions in some notable concept classes with a provable generalization error bound. In this paper, we analyze the training and generalization for RNNs with random initialization, and provide the following improvements over recent works: 1) For a RNN with input sequence x=(X1,X2,...,XL)x=(X_1,X_2,...,X_L)x=(X1​,X2​,...,XL​), previous works study to learn functions that are summation of f(βlTXl)f(\beta^T_lX_l)f(βlT​Xl​) and require normalized conditions that ∣∣Xl∣∣≤ϵ||X_l||\leq\epsilon∣∣Xl​∣∣≤ϵ with some very small ϵ\epsilonϵ depending on the complexity of fff. In this paper, using detailed analysis about the neural tangent kernel matrix, we prove a generalization error bound to learn such functions without normalized conditions and show that some notable concept classes are learnable with the numbers of iterations and samples scaling almost-polynomially in the input length LLL. 2) Moreover, we prove a novel result to learn N-variables functions of input sequence with the form f(βT[Xl1,...,XlN])f(\beta^T[X_{l_1},...,X_{l_N}])f(βT[Xl1​​,...,XlN​​]), which do not belong to the "additive" concept class, i,e., the summation of function f(Xl)f(X_l)f(Xl​). And we show that when either NNN or l0=max⁡(l1,..,lN)−min⁡(l1,..,lN)l_0=\max(l_1,..,l_N)-\min(l_1,..,l_N)l0​=max(l1​,..,lN​)−min(l1​,..,lN​) is small, f(βT[Xl1,...,XlN])f(\beta^T[X_{l_1},...,X_{l_N}])f(βT[Xl1​​,...,XlN​​]) will be learnable with the number iterations and samples scaling almost-polynomially in the input length LLL.

View on arXiv
Comments on this paper