On the Optimization and Generalization of Multi-head Attention

19 October 2023

Papers citing "On the Optimization and Generalization of Multi-head Attention"

7 / 7 papers shown

Title
How Transformers Learn Regular Language Recognition: A Theoretical Study on Training Dynamics and Implicit Bias Ruiquan Huang Yingbin Liang Jing Yang 43 0 0 02 May 2025
On the Learn-to-Optimize Capabilities of Transformers in In-Context Sparse Recovery Renpu Liu Ruida Zhou Cong Shen Jing Yang 23 0 0 17 Oct 2024
Implicit Bias and Fast Convergence Rates for Self-attention Bhavya Vasudeva Puneesh Deora Christos Thrampoulidis 24 13 0 08 Feb 2024
Restricted Strong Convexity of Deep Learning Models with Smooth Activations A. Banerjee Pedro Cisneros-Velarde Libin Zhu M. Belkin 21 5 0 29 Sep 2022
Stability and Generalization Analysis of Gradient Methods for Shallow Neural Networks Yunwen Lei Rong Jin Yiming Ying MLT 21 18 0 19 Sep 2022
A Local Convergence Theory for Mildly Over-Parameterized Two-Layer Neural Network Mo Zhou Rong Ge Chi Jin 67 44 0 04 Feb 2021
A Decomposable Attention Model for Natural Language Inference Ankur P. Parikh Oscar Täckström Dipanjan Das Jakob Uszkoreit 190 1,358 0 06 Jun 2016