ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2310.12680
  4. Cited By
On the Optimization and Generalization of Multi-head Attention

On the Optimization and Generalization of Multi-head Attention

19 October 2023
Puneesh Deora
Rouzbeh Ghaderi
Hossein Taheri
Christos Thrampoulidis
    MLT
ArXivPDFHTML

Papers citing "On the Optimization and Generalization of Multi-head Attention"

7 / 7 papers shown
Title
How Transformers Learn Regular Language Recognition: A Theoretical Study on Training Dynamics and Implicit Bias
How Transformers Learn Regular Language Recognition: A Theoretical Study on Training Dynamics and Implicit Bias
Ruiquan Huang
Yingbin Liang
Jing Yang
43
0
0
02 May 2025
On the Learn-to-Optimize Capabilities of Transformers in In-Context Sparse Recovery
On the Learn-to-Optimize Capabilities of Transformers in In-Context Sparse Recovery
Renpu Liu
Ruida Zhou
Cong Shen
Jing Yang
23
0
0
17 Oct 2024
Implicit Bias and Fast Convergence Rates for Self-attention
Implicit Bias and Fast Convergence Rates for Self-attention
Bhavya Vasudeva
Puneesh Deora
Christos Thrampoulidis
24
13
0
08 Feb 2024
Restricted Strong Convexity of Deep Learning Models with Smooth
  Activations
Restricted Strong Convexity of Deep Learning Models with Smooth Activations
A. Banerjee
Pedro Cisneros-Velarde
Libin Zhu
M. Belkin
21
5
0
29 Sep 2022
Stability and Generalization Analysis of Gradient Methods for Shallow
  Neural Networks
Stability and Generalization Analysis of Gradient Methods for Shallow Neural Networks
Yunwen Lei
Rong Jin
Yiming Ying
MLT
21
18
0
19 Sep 2022
A Local Convergence Theory for Mildly Over-Parameterized Two-Layer
  Neural Network
A Local Convergence Theory for Mildly Over-Parameterized Two-Layer Neural Network
Mo Zhou
Rong Ge
Chi Jin
67
44
0
04 Feb 2021
A Decomposable Attention Model for Natural Language Inference
A Decomposable Attention Model for Natural Language Inference
Ankur P. Parikh
Oscar Täckström
Dipanjan Das
Jakob Uszkoreit
190
1,358
0
06 Jun 2016
1