ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2312.03885
148
0
v1v2v3 (latest)

Adapting Newton's Method to Neural Networks through a Summary of Higher-Order Derivatives

6 December 2023
Pierre Wolinski
    ODL
ArXiv (abs)PDFHTML
Abstract

We consider a gradient-based optimization method applied to a function L\mathcal{L}L of a vector of variables θ\boldsymbol{\theta}θ, in the case where θ\boldsymbol{\theta}θ is represented as a tuple of tensors (T1,⋯ ,TS)(\mathbf{T}_1, \cdots, \mathbf{T}_S)(T1​,⋯,TS​). This framework encompasses many common use-cases, such as training neural networks by gradient descent. First, we propose a computationally inexpensive technique providing higher-order information on L\mathcal{L}L, especially about the interactions between the tensors Ts\mathbf{T}_sTs​, based on automatic differentiation and computational tricks. Second, we use this technique at order 2 to build a second-order optimization method which is suitable, among other things, for training deep neural networks of various architectures. This second-order method leverages the partition structure of θ\boldsymbol{\theta}θ into tensors (T1,⋯ ,TS)(\mathbf{T}_1, \cdots, \mathbf{T}_S)(T1​,⋯,TS​), in such a way that it requires neither the computation of the Hessian of L\mathcal{L}L according to θ\boldsymbol{\theta}θ, nor any approximation of it. The key part consists in computing a smaller matrix interpretable as a "Hessian according to the partition", which can be computed exactly and efficiently. In contrast to many existing practical second-order methods used in neural networks, which perform a diagonal or block-diagonal approximation of the Hessian or its inverse, the method we propose does not neglect interactions between layers. Finally, we can tune the coarseness of the partition to recover well-known optimization methods: the coarsest case corresponds to Cauchy's steepest descent method, the finest case corresponds to the usual Newton's method.

View on arXiv
Comments on this paper