v1v2 (latest)

Implicit Bias and Fast Convergence Rates for Self-attention

8 February 2024

Bhavya Vasudeva

Puneesh Deora

Christos Thrampoulidis

ArXiv (abs)PDF HTML

Papers citing "Implicit Bias and Fast Convergence Rates for Self-attention"

50 / 85 papers shown

Title
Transformers are almost optimal metalearners for linear classification Roey Magen Gal Vardi 100 0 0 22 Oct 2025
Learning Linear Regression with Low-Rank Tasks in-Context Kaito Takanami Takashi Takahashi Y. Kabashima 59 0 0 06 Oct 2025
Generalization or Hallucination? Understanding Out-of-Context Reasoning in Transformers Yixiao Huang Hanlin Zhu Tianyu Guo Jiantao Jiao Somayeh Sojoudi Michael I. Jordan Stuart Russell Song Mei LRM 437 4 0 12 Jun 2025
Transformative or Conservative? Conservation laws for ResNets and Transformers Sibylle Marcotte Rémi Gribonval Gabriel Peyré 196 3 0 06 Jun 2025
The Rich and the Simple: On the Implicit Bias of Adam and SGD Bhavya Vasudeva Jung Whan Lee Willie Neiswanger Mahdi Soltanolkotabi 155 4 0 29 May 2025
Variational Deep Learning via Implicit Regularization Jonathan Wenger Beau Coker Juraj Marusic John P. Cunningham OOD UQCV BDL 256 1 0 26 May 2025
How Transformers Learn In-Context Recall Tasks? Optimality, Training Dynamics and Generalization Quan Nguyen Thanh Nguyen-Tang MLT 286 1 0 21 May 2025
How Transformers Learn Regular Language Recognition: A Theoretical Study on Training Dynamics and Implicit Bias Ruiquan Huang Yingbin Liang Jing Yang 480 4 0 02 May 2025
Revisiting Transformers through the Lens of Low Entropy and Dynamic Sparsity Ruifeng Ren Yong Liu 842 1 0 26 Apr 2025
Gating is Weighting: Understanding Gated Linear Attention through In-context Learning Yingcong Li Davoud Ataee Tarzanagh A. S. Rawat Maryam Fazel Samet Oymak 147 4 0 06 Apr 2025
When Do Transformers Outperform Feedforward and Recurrent Networks? A Statistical Perspective Alireza Mousavi-Hosseini Clayton Sanford Denny Wu Murat A. Erdogdu 263 3 0 14 Mar 2025
Training Dynamics of In-Context Learning in Linear Attention Yedi Zhang Aaditya K. Singh Peter E. Latham Andrew Saxe MLT 251 19 0 27 Jan 2025
On the Learn-to-Optimize Capabilities of Transformers in In-Context Sparse RecoveryInternational Conference on Learning Representations (ICLR), 2024 Renpu Liu Ruida Zhou Cong Shen Jing Yang 357 3 0 17 Oct 2024
Trained Transformer Classifiers Generalize and Exhibit Benign Overfitting In-ContextInternational Conference on Learning Representations (ICLR), 2024 Spencer Frei Gal Vardi MLT 223 9 0 02 Oct 2024
Non-asymptotic Convergence of Training Transformers for Next-token PredictionNeural Information Processing Systems (NeurIPS), 2024 Ruiquan Huang Yingbin Liang Jing Yang 210 10 0 25 Sep 2024
Implicit Regularization of Gradient Flow on One-Layer Softmax Attention Heejune Sheen Siyu Chen Tianhao Wang Harrison H. Zhou MLT 185 13 0 13 Mar 2024
Mechanics of Next Token Prediction with Self-AttentionInternational Conference on Artificial Intelligence and Statistics (AISTATS), 2024 Yingcong Li Yixiao Huang M. E. Ildiz A. S. Rawat Samet Oymak 178 39 0 12 Mar 2024
Transformers Learn Low Sensitivity Functions: Investigations and ImplicationsInternational Conference on Learning Representations (ICLR), 2024 Bhavya Vasudeva Deqing Fu Tianyi Zhou Elliott Kau Youqi Huang Willie Neiswanger 335 2 0 11 Mar 2024
From Self-Attention to Markov Models: Unveiling the Dynamics of Generative Transformers M. E. Ildiz Yixiao Huang Yingcong Li A. S. Rawat Samet Oymak 152 33 0 21 Feb 2024
Attention with Markov: A Framework for Principled Analysis of Transformers via Markov Chains Ashok Vardhan Makkuva Marco Bondaschi Adway Girish Alliot Nagle Martin Jaggi Hyeji Kim Michael C. Gastpar OffRL 294 36 0 06 Feb 2024
Implicit Bias of Gradient Descent for Two-layer ReLU and Leaky ReLU Networks on Nearly-orthogonal DataNeural Information Processing Systems (NeurIPS), 2023 Yiwen Kou Zixiang Chen Quanquan Gu MLT 114 17 0 29 Oct 2023
On the Optimization and Generalization of Multi-head Attention Puneesh Deora Rouzbeh Ghaderi Hossein Taheri Christos Thrampoulidis MLT 228 41 0 19 Oct 2023
JoMA: Demystifying Multilayer Transformers via JOint Dynamics of MLP and AttentionInternational Conference on Learning Representations (ICLR), 2023 Yuandong Tian Yiping Wang Zhenyu Zhang Beidi Chen Simon Shaolei Du 246 45 0 01 Oct 2023
Max-Margin Token Selection in Attention MechanismNeural Information Processing Systems (NeurIPS), 2023 Davoud Ataee Tarzanagh Yingcong Li Xuechen Zhang Samet Oymak 431 51 0 23 Jun 2023
Trained Transformers Learn Linear Models In-ContextJournal of machine learning research (JMLR), 2023 Ruiqi Zhang Spencer Frei Peter L. Bartlett 317 270 0 16 Jun 2023
On the Role of Attention in Prompt-tuningInternational Conference on Machine Learning (ICML), 2023 Samet Oymak A. S. Rawat Mahdi Soltanolkotabi Christos Thrampoulidis MLT LRM 152 57 0 06 Jun 2023
Representational Strengths and Limitations of TransformersNeural Information Processing Systems (NeurIPS), 2023 Clayton Sanford Daniel J. Hsu Matus Telgarsky 236 112 0 05 Jun 2023
Memorization Capacity of Multi-Head Attention in TransformersInternational Conference on Learning Representations (ICLR), 2023 Sadegh Mahdavi Renjie Liao Christos Thrampoulidis 362 33 0 03 Jun 2023
Birth of a Transformer: A Memory ViewpointNeural Information Processing Systems (NeurIPS), 2023 A. Bietti Vivien A. Cabannes Diane Bouchacourt Edouard Grave Léon Bottou 330 138 0 01 Jun 2023
Scan and Snap: Understanding Training Dynamics and Token Composition in 1-layer TransformerNeural Information Processing Systems (NeurIPS), 2023 Yuandong Tian Yiping Wang Beidi Chen S. Du MLT 364 96 0 25 May 2023
Fast Convergence in Learning Two-Layer Neural Networks with Separable DataAAAI Conference on Artificial Intelligence (AAAI), 2023 Hossein Taheri Christos Thrampoulidis MLT 197 3 0 22 May 2023
MoMo: Momentum Models for Adaptive Learning RatesInternational Conference on Machine Learning (ICML), 2023 Fabian Schaipp Ruben Ohana Michael Eickenberg Aaron Defazio Robert Mansel Gower 277 19 0 12 May 2023
Noise Is Not the Main Factor Behind the Gap Between SGD and Adam on Transformers, but Sign Descent Might BeInternational Conference on Learning Representations (ICLR), 2023 Frederik Kunstner Jacques Chen J. Lavington Mark Schmidt 248 100 0 27 Apr 2023
Benign Overfitting for Two-layer ReLU Convolutional Neural NetworksInternational Conference on Machine Learning (ICML), 2023 Yiwen Kou Zi-Yuan Chen Yuanzhou Chen Quanquan Gu MLT 177 23 0 07 Mar 2023
Benign Overfitting in Linear Classifiers and Leaky ReLU Networks from KKT Conditions for Margin MaximizationAnnual Conference Computational Learning Theory (COLT), 2023 Spencer Frei Gal Vardi Peter L. Bartlett Nathan Srebro 179 28 0 02 Mar 2023
Generalization and Stability of Interpolating Neural Networks with Minimal WidthJournal of machine learning research (JMLR), 2023 Hossein Taheri Christos Thrampoulidis 275 20 0 18 Feb 2023
A Theoretical Understanding of Shallow Vision Transformers: Learning, Generalization, and Sample ComplexityInternational Conference on Learning Representations (ICLR), 2023 Hongkang Li Ming Wang Sijia Liu Pin-Yu Chen ViT MLT 437 77 0 12 Feb 2023
Transformers as Algorithms: Generalization and Stability in In-context LearningInternational Conference on Machine Learning (ICML), 2023 Yingcong Li M. E. Ildiz Dimitris Papailiopoulos Samet Oymak 243 217 0 17 Jan 2023
Transformers learn in-context by gradient descentInternational Conference on Machine Learning (ICML), 2022 J. Oswald Eyvind Niklasson E. Randazzo João Sacramento A. Mordvintsev A. Zhmoginov Max Vladymyrov MLT 394 625 0 15 Dec 2022
What learning algorithm is in-context learning? Investigations with linear modelsInternational Conference on Learning Representations (ICLR), 2022 Ekin Akyürek Dale Schuurmans Jacob Andreas Tengyu Ma Denny Zhou 434 595 0 28 Nov 2022
Convexifying Transformers: Improving optimization and understanding of transformer networks Tolga Ergen Behnam Neyshabur Harsh Mehta MLT 191 15 0 20 Nov 2022
Vision Transformers provably learn spatial structureNeural Information Processing Systems (NeurIPS), 2022 Samy Jelassi Michael E. Sander Yuan-Fang Li ViT MLT 179 99 0 13 Oct 2022
Implicit Bias in Leaky ReLU Networks Trained on High-Dimensional DataInternational Conference on Learning Representations (ICLR), 2022 Spencer Frei Gal Vardi Peter L. Bartlett Nathan Srebro Wei Hu MLT 191 48 0 13 Oct 2022
On the Implicit Bias in Deep-Learning AlgorithmsCommunications of the ACM (CACM), 2022 Gal Vardi FedML AI4CE 288 107 0 26 Aug 2022
Adan: Adaptive Nesterov Momentum Algorithm for Faster Optimizing Deep ModelsIEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2022 Xingyu Xie Pan Zhou Huan Li Zhouchen Lin Shuicheng Yan ODL 313 233 0 13 Aug 2022
Unraveling Attention via Convex Duality: Analysis and Interpretations of Vision TransformersInternational Conference on Machine Learning (ICML), 2022 Arda Sahiner Tolga Ergen Batu Mehmet Ozturkler John M. Pauly Morteza Mardani Mert Pilanci 267 35 0 17 May 2022
The Quarks of AttentionArtificial Intelligence (AIJ), 2022 Pierre Baldi Roman Vershynin GNN 88 11 0 15 Feb 2022
Benign Overfitting in Two-layer Convolutional Neural NetworksNeural Information Processing Systems (NeurIPS), 2022 Yuan Cao Zixiang Chen M. Belkin Quanquan Gu MLT 313 105 0 14 Feb 2022
Benign Overfitting without Linearity: Neural Network Classifiers Trained by Gradient Descent for Noisy Linear DataAnnual Conference Computational Learning Theory (COLT), 2022 Spencer Frei Niladri S. Chatterji Peter L. Bartlett MLT 396 86 0 11 Feb 2022
Vision Transformer for Small-Size Datasets Seung Hoon Lee Seunghyun Lee B. Song ViT 185 276 0 27 Dec 2021