Communities
Connect sessions
AI calendar
Organizations
Join Slack
Contact Sales
Search
Open menu
Home
Papers
2402.05738
Cited By
v1
v2 (latest)
Implicit Bias and Fast Convergence Rates for Self-attention
8 February 2024
Bhavya Vasudeva
Puneesh Deora
Christos Thrampoulidis
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"Implicit Bias and Fast Convergence Rates for Self-attention"
50 / 85 papers shown
Title
Transformers are almost optimal metalearners for linear classification
Roey Magen
Gal Vardi
100
0
0
22 Oct 2025
Learning Linear Regression with Low-Rank Tasks in-Context
Kaito Takanami
Takashi Takahashi
Y. Kabashima
59
0
0
06 Oct 2025
Generalization or Hallucination? Understanding Out-of-Context Reasoning in Transformers
Yixiao Huang
Hanlin Zhu
Tianyu Guo
Jiantao Jiao
Somayeh Sojoudi
Michael I. Jordan
Stuart Russell
Song Mei
LRM
437
4
0
12 Jun 2025
Transformative or Conservative? Conservation laws for ResNets and Transformers
Sibylle Marcotte
Rémi Gribonval
Gabriel Peyré
196
3
0
06 Jun 2025
The Rich and the Simple: On the Implicit Bias of Adam and SGD
Bhavya Vasudeva
Jung Whan Lee
Willie Neiswanger
Mahdi Soltanolkotabi
155
4
0
29 May 2025
Variational Deep Learning via Implicit Regularization
Jonathan Wenger
Beau Coker
Juraj Marusic
John P. Cunningham
OOD
UQCV
BDL
256
1
0
26 May 2025
How Transformers Learn In-Context Recall Tasks? Optimality, Training Dynamics and Generalization
Quan Nguyen
Thanh Nguyen-Tang
MLT
286
1
0
21 May 2025
How Transformers Learn Regular Language Recognition: A Theoretical Study on Training Dynamics and Implicit Bias
Ruiquan Huang
Yingbin Liang
Jing Yang
480
4
0
02 May 2025
Revisiting Transformers through the Lens of Low Entropy and Dynamic Sparsity
Ruifeng Ren
Yong Liu
842
1
0
26 Apr 2025
Gating is Weighting: Understanding Gated Linear Attention through In-context Learning
Yingcong Li
Davoud Ataee Tarzanagh
A. S. Rawat
Maryam Fazel
Samet Oymak
147
4
0
06 Apr 2025
When Do Transformers Outperform Feedforward and Recurrent Networks? A Statistical Perspective
Alireza Mousavi-Hosseini
Clayton Sanford
Denny Wu
Murat A. Erdogdu
263
3
0
14 Mar 2025
Training Dynamics of In-Context Learning in Linear Attention
Yedi Zhang
Aaditya K. Singh
Peter E. Latham
Andrew Saxe
MLT
251
19
0
27 Jan 2025
On the Learn-to-Optimize Capabilities of Transformers in In-Context Sparse Recovery
International Conference on Learning Representations (ICLR), 2024
Renpu Liu
Ruida Zhou
Cong Shen
Jing Yang
357
3
0
17 Oct 2024
Trained Transformer Classifiers Generalize and Exhibit Benign Overfitting In-Context
International Conference on Learning Representations (ICLR), 2024
Spencer Frei
Gal Vardi
MLT
223
9
0
02 Oct 2024
Non-asymptotic Convergence of Training Transformers for Next-token Prediction
Neural Information Processing Systems (NeurIPS), 2024
Ruiquan Huang
Yingbin Liang
Jing Yang
210
10
0
25 Sep 2024
Implicit Regularization of Gradient Flow on One-Layer Softmax Attention
Heejune Sheen
Siyu Chen
Tianhao Wang
Harrison H. Zhou
MLT
185
13
0
13 Mar 2024
Mechanics of Next Token Prediction with Self-Attention
International Conference on Artificial Intelligence and Statistics (AISTATS), 2024
Yingcong Li
Yixiao Huang
M. E. Ildiz
A. S. Rawat
Samet Oymak
178
39
0
12 Mar 2024
Transformers Learn Low Sensitivity Functions: Investigations and Implications
International Conference on Learning Representations (ICLR), 2024
Bhavya Vasudeva
Deqing Fu
Tianyi Zhou
Elliott Kau
Youqi Huang
Willie Neiswanger
335
2
0
11 Mar 2024
From Self-Attention to Markov Models: Unveiling the Dynamics of Generative Transformers
M. E. Ildiz
Yixiao Huang
Yingcong Li
A. S. Rawat
Samet Oymak
152
33
0
21 Feb 2024
Attention with Markov: A Framework for Principled Analysis of Transformers via Markov Chains
Ashok Vardhan Makkuva
Marco Bondaschi
Adway Girish
Alliot Nagle
Martin Jaggi
Hyeji Kim
Michael C. Gastpar
OffRL
294
36
0
06 Feb 2024
Implicit Bias of Gradient Descent for Two-layer ReLU and Leaky ReLU Networks on Nearly-orthogonal Data
Neural Information Processing Systems (NeurIPS), 2023
Yiwen Kou
Zixiang Chen
Quanquan Gu
MLT
114
17
0
29 Oct 2023
On the Optimization and Generalization of Multi-head Attention
Puneesh Deora
Rouzbeh Ghaderi
Hossein Taheri
Christos Thrampoulidis
MLT
228
41
0
19 Oct 2023
JoMA: Demystifying Multilayer Transformers via JOint Dynamics of MLP and Attention
International Conference on Learning Representations (ICLR), 2023
Yuandong Tian
Yiping Wang
Zhenyu Zhang
Beidi Chen
Simon Shaolei Du
246
45
0
01 Oct 2023
Max-Margin Token Selection in Attention Mechanism
Neural Information Processing Systems (NeurIPS), 2023
Davoud Ataee Tarzanagh
Yingcong Li
Xuechen Zhang
Samet Oymak
431
51
0
23 Jun 2023
Trained Transformers Learn Linear Models In-Context
Journal of machine learning research (JMLR), 2023
Ruiqi Zhang
Spencer Frei
Peter L. Bartlett
317
270
0
16 Jun 2023
On the Role of Attention in Prompt-tuning
International Conference on Machine Learning (ICML), 2023
Samet Oymak
A. S. Rawat
Mahdi Soltanolkotabi
Christos Thrampoulidis
MLT
LRM
152
57
0
06 Jun 2023
Representational Strengths and Limitations of Transformers
Neural Information Processing Systems (NeurIPS), 2023
Clayton Sanford
Daniel J. Hsu
Matus Telgarsky
236
112
0
05 Jun 2023
Memorization Capacity of Multi-Head Attention in Transformers
International Conference on Learning Representations (ICLR), 2023
Sadegh Mahdavi
Renjie Liao
Christos Thrampoulidis
362
33
0
03 Jun 2023
Birth of a Transformer: A Memory Viewpoint
Neural Information Processing Systems (NeurIPS), 2023
A. Bietti
Vivien A. Cabannes
Diane Bouchacourt
Edouard Grave
Léon Bottou
330
138
0
01 Jun 2023
Scan and Snap: Understanding Training Dynamics and Token Composition in 1-layer Transformer
Neural Information Processing Systems (NeurIPS), 2023
Yuandong Tian
Yiping Wang
Beidi Chen
S. Du
MLT
364
96
0
25 May 2023
Fast Convergence in Learning Two-Layer Neural Networks with Separable Data
AAAI Conference on Artificial Intelligence (AAAI), 2023
Hossein Taheri
Christos Thrampoulidis
MLT
197
3
0
22 May 2023
MoMo: Momentum Models for Adaptive Learning Rates
International Conference on Machine Learning (ICML), 2023
Fabian Schaipp
Ruben Ohana
Michael Eickenberg
Aaron Defazio
Robert Mansel Gower
277
19
0
12 May 2023
Noise Is Not the Main Factor Behind the Gap Between SGD and Adam on Transformers, but Sign Descent Might Be
International Conference on Learning Representations (ICLR), 2023
Frederik Kunstner
Jacques Chen
J. Lavington
Mark Schmidt
248
100
0
27 Apr 2023
Benign Overfitting for Two-layer ReLU Convolutional Neural Networks
International Conference on Machine Learning (ICML), 2023
Yiwen Kou
Zi-Yuan Chen
Yuanzhou Chen
Quanquan Gu
MLT
177
23
0
07 Mar 2023
Benign Overfitting in Linear Classifiers and Leaky ReLU Networks from KKT Conditions for Margin Maximization
Annual Conference Computational Learning Theory (COLT), 2023
Spencer Frei
Gal Vardi
Peter L. Bartlett
Nathan Srebro
179
28
0
02 Mar 2023
Generalization and Stability of Interpolating Neural Networks with Minimal Width
Journal of machine learning research (JMLR), 2023
Hossein Taheri
Christos Thrampoulidis
275
20
0
18 Feb 2023
A Theoretical Understanding of Shallow Vision Transformers: Learning, Generalization, and Sample Complexity
International Conference on Learning Representations (ICLR), 2023
Hongkang Li
Ming Wang
Sijia Liu
Pin-Yu Chen
ViT
MLT
437
77
0
12 Feb 2023
Transformers as Algorithms: Generalization and Stability in In-context Learning
International Conference on Machine Learning (ICML), 2023
Yingcong Li
M. E. Ildiz
Dimitris Papailiopoulos
Samet Oymak
243
217
0
17 Jan 2023
Transformers learn in-context by gradient descent
International Conference on Machine Learning (ICML), 2022
J. Oswald
Eyvind Niklasson
E. Randazzo
João Sacramento
A. Mordvintsev
A. Zhmoginov
Max Vladymyrov
MLT
394
625
0
15 Dec 2022
What learning algorithm is in-context learning? Investigations with linear models
International Conference on Learning Representations (ICLR), 2022
Ekin Akyürek
Dale Schuurmans
Jacob Andreas
Tengyu Ma
Denny Zhou
434
595
0
28 Nov 2022
Convexifying Transformers: Improving optimization and understanding of transformer networks
Tolga Ergen
Behnam Neyshabur
Harsh Mehta
MLT
191
15
0
20 Nov 2022
Vision Transformers provably learn spatial structure
Neural Information Processing Systems (NeurIPS), 2022
Samy Jelassi
Michael E. Sander
Yuan-Fang Li
ViT
MLT
179
99
0
13 Oct 2022
Implicit Bias in Leaky ReLU Networks Trained on High-Dimensional Data
International Conference on Learning Representations (ICLR), 2022
Spencer Frei
Gal Vardi
Peter L. Bartlett
Nathan Srebro
Wei Hu
MLT
191
48
0
13 Oct 2022
On the Implicit Bias in Deep-Learning Algorithms
Communications of the ACM (CACM), 2022
Gal Vardi
FedML
AI4CE
288
107
0
26 Aug 2022
Adan: Adaptive Nesterov Momentum Algorithm for Faster Optimizing Deep Models
IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2022
Xingyu Xie
Pan Zhou
Huan Li
Zhouchen Lin
Shuicheng Yan
ODL
313
233
0
13 Aug 2022
Unraveling Attention via Convex Duality: Analysis and Interpretations of Vision Transformers
International Conference on Machine Learning (ICML), 2022
Arda Sahiner
Tolga Ergen
Batu Mehmet Ozturkler
John M. Pauly
Morteza Mardani
Mert Pilanci
267
35
0
17 May 2022
The Quarks of Attention
Artificial Intelligence (AIJ), 2022
Pierre Baldi
Roman Vershynin
GNN
88
11
0
15 Feb 2022
Benign Overfitting in Two-layer Convolutional Neural Networks
Neural Information Processing Systems (NeurIPS), 2022
Yuan Cao
Zixiang Chen
M. Belkin
Quanquan Gu
MLT
313
105
0
14 Feb 2022
Benign Overfitting without Linearity: Neural Network Classifiers Trained by Gradient Descent for Noisy Linear Data
Annual Conference Computational Learning Theory (COLT), 2022
Spencer Frei
Niladri S. Chatterji
Peter L. Bartlett
MLT
396
86
0
11 Feb 2022
Vision Transformer for Small-Size Datasets
Seung Hoon Lee
Seunghyun Lee
B. Song
ViT
185
276
0
27 Dec 2021
1
2
Next