ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2402.05738
  4. Cited By
Implicit Bias and Fast Convergence Rates for Self-attention
v1v2 (latest)

Implicit Bias and Fast Convergence Rates for Self-attention

8 February 2024
Bhavya Vasudeva
Puneesh Deora
Christos Thrampoulidis
ArXiv (abs)PDFHTML

Papers citing "Implicit Bias and Fast Convergence Rates for Self-attention"

50 / 85 papers shown
Title
Transformers are almost optimal metalearners for linear classification
Transformers are almost optimal metalearners for linear classification
Roey Magen
Gal Vardi
100
0
0
22 Oct 2025
Learning Linear Regression with Low-Rank Tasks in-Context
Learning Linear Regression with Low-Rank Tasks in-Context
Kaito Takanami
Takashi Takahashi
Y. Kabashima
59
0
0
06 Oct 2025
Generalization or Hallucination? Understanding Out-of-Context Reasoning in Transformers
Generalization or Hallucination? Understanding Out-of-Context Reasoning in Transformers
Yixiao Huang
Hanlin Zhu
Tianyu Guo
Jiantao Jiao
Somayeh Sojoudi
Michael I. Jordan
Stuart Russell
Song Mei
LRM
437
4
0
12 Jun 2025
Transformative or Conservative? Conservation laws for ResNets and Transformers
Transformative or Conservative? Conservation laws for ResNets and Transformers
Sibylle Marcotte
Rémi Gribonval
Gabriel Peyré
196
3
0
06 Jun 2025
The Rich and the Simple: On the Implicit Bias of Adam and SGD
The Rich and the Simple: On the Implicit Bias of Adam and SGD
Bhavya Vasudeva
Jung Whan Lee
Willie Neiswanger
Mahdi Soltanolkotabi
155
4
0
29 May 2025
Variational Deep Learning via Implicit Regularization
Variational Deep Learning via Implicit Regularization
Jonathan Wenger
Beau Coker
Juraj Marusic
John P. Cunningham
OODUQCVBDL
256
1
0
26 May 2025
How Transformers Learn In-Context Recall Tasks? Optimality, Training Dynamics and Generalization
How Transformers Learn In-Context Recall Tasks? Optimality, Training Dynamics and Generalization
Quan Nguyen
Thanh Nguyen-Tang
MLT
286
1
0
21 May 2025
How Transformers Learn Regular Language Recognition: A Theoretical Study on Training Dynamics and Implicit Bias
How Transformers Learn Regular Language Recognition: A Theoretical Study on Training Dynamics and Implicit Bias
Ruiquan Huang
Yingbin Liang
Jing Yang
480
4
0
02 May 2025
Revisiting Transformers through the Lens of Low Entropy and Dynamic Sparsity
Revisiting Transformers through the Lens of Low Entropy and Dynamic Sparsity
Ruifeng Ren
Yong Liu
842
1
0
26 Apr 2025
Gating is Weighting: Understanding Gated Linear Attention through In-context Learning
Gating is Weighting: Understanding Gated Linear Attention through In-context Learning
Yingcong Li
Davoud Ataee Tarzanagh
A. S. Rawat
Maryam Fazel
Samet Oymak
147
4
0
06 Apr 2025
When Do Transformers Outperform Feedforward and Recurrent Networks? A Statistical Perspective
Alireza Mousavi-Hosseini
Clayton Sanford
Denny Wu
Murat A. Erdogdu
263
3
0
14 Mar 2025
Training Dynamics of In-Context Learning in Linear Attention
Training Dynamics of In-Context Learning in Linear Attention
Yedi Zhang
Aaditya K. Singh
Peter E. Latham
Andrew Saxe
MLT
251
19
0
27 Jan 2025
On the Learn-to-Optimize Capabilities of Transformers in In-Context Sparse Recovery
On the Learn-to-Optimize Capabilities of Transformers in In-Context Sparse RecoveryInternational Conference on Learning Representations (ICLR), 2024
Renpu Liu
Ruida Zhou
Cong Shen
Jing Yang
357
3
0
17 Oct 2024
Trained Transformer Classifiers Generalize and Exhibit Benign
  Overfitting In-Context
Trained Transformer Classifiers Generalize and Exhibit Benign Overfitting In-ContextInternational Conference on Learning Representations (ICLR), 2024
Spencer Frei
Gal Vardi
MLT
223
9
0
02 Oct 2024
Non-asymptotic Convergence of Training Transformers for Next-token
  Prediction
Non-asymptotic Convergence of Training Transformers for Next-token PredictionNeural Information Processing Systems (NeurIPS), 2024
Ruiquan Huang
Yingbin Liang
Jing Yang
210
10
0
25 Sep 2024
Implicit Regularization of Gradient Flow on One-Layer Softmax Attention
Implicit Regularization of Gradient Flow on One-Layer Softmax Attention
Heejune Sheen
Siyu Chen
Tianhao Wang
Harrison H. Zhou
MLT
185
13
0
13 Mar 2024
Mechanics of Next Token Prediction with Self-Attention
Mechanics of Next Token Prediction with Self-AttentionInternational Conference on Artificial Intelligence and Statistics (AISTATS), 2024
Yingcong Li
Yixiao Huang
M. E. Ildiz
A. S. Rawat
Samet Oymak
178
39
0
12 Mar 2024
Transformers Learn Low Sensitivity Functions: Investigations and Implications
Transformers Learn Low Sensitivity Functions: Investigations and ImplicationsInternational Conference on Learning Representations (ICLR), 2024
Bhavya Vasudeva
Deqing Fu
Tianyi Zhou
Elliott Kau
Youqi Huang
Willie Neiswanger
335
2
0
11 Mar 2024
From Self-Attention to Markov Models: Unveiling the Dynamics of
  Generative Transformers
From Self-Attention to Markov Models: Unveiling the Dynamics of Generative Transformers
M. E. Ildiz
Yixiao Huang
Yingcong Li
A. S. Rawat
Samet Oymak
152
33
0
21 Feb 2024
Attention with Markov: A Framework for Principled Analysis of Transformers via Markov Chains
Attention with Markov: A Framework for Principled Analysis of Transformers via Markov Chains
Ashok Vardhan Makkuva
Marco Bondaschi
Adway Girish
Alliot Nagle
Martin Jaggi
Hyeji Kim
Michael C. Gastpar
OffRL
294
36
0
06 Feb 2024
Implicit Bias of Gradient Descent for Two-layer ReLU and Leaky ReLU
  Networks on Nearly-orthogonal Data
Implicit Bias of Gradient Descent for Two-layer ReLU and Leaky ReLU Networks on Nearly-orthogonal DataNeural Information Processing Systems (NeurIPS), 2023
Yiwen Kou
Zixiang Chen
Quanquan Gu
MLT
114
17
0
29 Oct 2023
On the Optimization and Generalization of Multi-head Attention
On the Optimization and Generalization of Multi-head Attention
Puneesh Deora
Rouzbeh Ghaderi
Hossein Taheri
Christos Thrampoulidis
MLT
228
41
0
19 Oct 2023
JoMA: Demystifying Multilayer Transformers via JOint Dynamics of MLP and
  Attention
JoMA: Demystifying Multilayer Transformers via JOint Dynamics of MLP and AttentionInternational Conference on Learning Representations (ICLR), 2023
Yuandong Tian
Yiping Wang
Zhenyu Zhang
Beidi Chen
Simon Shaolei Du
246
45
0
01 Oct 2023
Max-Margin Token Selection in Attention Mechanism
Max-Margin Token Selection in Attention MechanismNeural Information Processing Systems (NeurIPS), 2023
Davoud Ataee Tarzanagh
Yingcong Li
Xuechen Zhang
Samet Oymak
431
51
0
23 Jun 2023
Trained Transformers Learn Linear Models In-Context
Trained Transformers Learn Linear Models In-ContextJournal of machine learning research (JMLR), 2023
Ruiqi Zhang
Spencer Frei
Peter L. Bartlett
317
270
0
16 Jun 2023
On the Role of Attention in Prompt-tuning
On the Role of Attention in Prompt-tuningInternational Conference on Machine Learning (ICML), 2023
Samet Oymak
A. S. Rawat
Mahdi Soltanolkotabi
Christos Thrampoulidis
MLTLRM
152
57
0
06 Jun 2023
Representational Strengths and Limitations of Transformers
Representational Strengths and Limitations of TransformersNeural Information Processing Systems (NeurIPS), 2023
Clayton Sanford
Daniel J. Hsu
Matus Telgarsky
236
112
0
05 Jun 2023
Memorization Capacity of Multi-Head Attention in Transformers
Memorization Capacity of Multi-Head Attention in TransformersInternational Conference on Learning Representations (ICLR), 2023
Sadegh Mahdavi
Renjie Liao
Christos Thrampoulidis
362
33
0
03 Jun 2023
Birth of a Transformer: A Memory Viewpoint
Birth of a Transformer: A Memory ViewpointNeural Information Processing Systems (NeurIPS), 2023
A. Bietti
Vivien A. Cabannes
Diane Bouchacourt
Edouard Grave
Léon Bottou
330
138
0
01 Jun 2023
Scan and Snap: Understanding Training Dynamics and Token Composition in
  1-layer Transformer
Scan and Snap: Understanding Training Dynamics and Token Composition in 1-layer TransformerNeural Information Processing Systems (NeurIPS), 2023
Yuandong Tian
Yiping Wang
Beidi Chen
S. Du
MLT
364
96
0
25 May 2023
Fast Convergence in Learning Two-Layer Neural Networks with Separable
  Data
Fast Convergence in Learning Two-Layer Neural Networks with Separable DataAAAI Conference on Artificial Intelligence (AAAI), 2023
Hossein Taheri
Christos Thrampoulidis
MLT
197
3
0
22 May 2023
MoMo: Momentum Models for Adaptive Learning Rates
MoMo: Momentum Models for Adaptive Learning RatesInternational Conference on Machine Learning (ICML), 2023
Fabian Schaipp
Ruben Ohana
Michael Eickenberg
Aaron Defazio
Robert Mansel Gower
277
19
0
12 May 2023
Noise Is Not the Main Factor Behind the Gap Between SGD and Adam on
  Transformers, but Sign Descent Might Be
Noise Is Not the Main Factor Behind the Gap Between SGD and Adam on Transformers, but Sign Descent Might BeInternational Conference on Learning Representations (ICLR), 2023
Frederik Kunstner
Jacques Chen
J. Lavington
Mark Schmidt
248
100
0
27 Apr 2023
Benign Overfitting for Two-layer ReLU Convolutional Neural Networks
Benign Overfitting for Two-layer ReLU Convolutional Neural NetworksInternational Conference on Machine Learning (ICML), 2023
Yiwen Kou
Zi-Yuan Chen
Yuanzhou Chen
Quanquan Gu
MLT
177
23
0
07 Mar 2023
Benign Overfitting in Linear Classifiers and Leaky ReLU Networks from
  KKT Conditions for Margin Maximization
Benign Overfitting in Linear Classifiers and Leaky ReLU Networks from KKT Conditions for Margin MaximizationAnnual Conference Computational Learning Theory (COLT), 2023
Spencer Frei
Gal Vardi
Peter L. Bartlett
Nathan Srebro
179
28
0
02 Mar 2023
Generalization and Stability of Interpolating Neural Networks with
  Minimal Width
Generalization and Stability of Interpolating Neural Networks with Minimal WidthJournal of machine learning research (JMLR), 2023
Hossein Taheri
Christos Thrampoulidis
275
20
0
18 Feb 2023
A Theoretical Understanding of Shallow Vision Transformers: Learning,
  Generalization, and Sample Complexity
A Theoretical Understanding of Shallow Vision Transformers: Learning, Generalization, and Sample ComplexityInternational Conference on Learning Representations (ICLR), 2023
Hongkang Li
Ming Wang
Sijia Liu
Pin-Yu Chen
ViTMLT
437
77
0
12 Feb 2023
Transformers as Algorithms: Generalization and Stability in In-context
  Learning
Transformers as Algorithms: Generalization and Stability in In-context LearningInternational Conference on Machine Learning (ICML), 2023
Yingcong Li
M. E. Ildiz
Dimitris Papailiopoulos
Samet Oymak
243
217
0
17 Jan 2023
Transformers learn in-context by gradient descent
Transformers learn in-context by gradient descentInternational Conference on Machine Learning (ICML), 2022
J. Oswald
Eyvind Niklasson
E. Randazzo
João Sacramento
A. Mordvintsev
A. Zhmoginov
Max Vladymyrov
MLT
394
625
0
15 Dec 2022
What learning algorithm is in-context learning? Investigations with
  linear models
What learning algorithm is in-context learning? Investigations with linear modelsInternational Conference on Learning Representations (ICLR), 2022
Ekin Akyürek
Dale Schuurmans
Jacob Andreas
Tengyu Ma
Denny Zhou
434
595
0
28 Nov 2022
Convexifying Transformers: Improving optimization and understanding of
  transformer networks
Convexifying Transformers: Improving optimization and understanding of transformer networks
Tolga Ergen
Behnam Neyshabur
Harsh Mehta
MLT
191
15
0
20 Nov 2022
Vision Transformers provably learn spatial structure
Vision Transformers provably learn spatial structureNeural Information Processing Systems (NeurIPS), 2022
Samy Jelassi
Michael E. Sander
Yuan-Fang Li
ViTMLT
179
99
0
13 Oct 2022
Implicit Bias in Leaky ReLU Networks Trained on High-Dimensional Data
Implicit Bias in Leaky ReLU Networks Trained on High-Dimensional DataInternational Conference on Learning Representations (ICLR), 2022
Spencer Frei
Gal Vardi
Peter L. Bartlett
Nathan Srebro
Wei Hu
MLT
191
48
0
13 Oct 2022
On the Implicit Bias in Deep-Learning Algorithms
On the Implicit Bias in Deep-Learning AlgorithmsCommunications of the ACM (CACM), 2022
Gal Vardi
FedMLAI4CE
288
107
0
26 Aug 2022
Adan: Adaptive Nesterov Momentum Algorithm for Faster Optimizing Deep
  Models
Adan: Adaptive Nesterov Momentum Algorithm for Faster Optimizing Deep ModelsIEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2022
Xingyu Xie
Pan Zhou
Huan Li
Zhouchen Lin
Shuicheng Yan
ODL
313
233
0
13 Aug 2022
Unraveling Attention via Convex Duality: Analysis and Interpretations of
  Vision Transformers
Unraveling Attention via Convex Duality: Analysis and Interpretations of Vision TransformersInternational Conference on Machine Learning (ICML), 2022
Arda Sahiner
Tolga Ergen
Batu Mehmet Ozturkler
John M. Pauly
Morteza Mardani
Mert Pilanci
267
35
0
17 May 2022
The Quarks of Attention
The Quarks of AttentionArtificial Intelligence (AIJ), 2022
Pierre Baldi
Roman Vershynin
GNN
88
11
0
15 Feb 2022
Benign Overfitting in Two-layer Convolutional Neural Networks
Benign Overfitting in Two-layer Convolutional Neural NetworksNeural Information Processing Systems (NeurIPS), 2022
Yuan Cao
Zixiang Chen
M. Belkin
Quanquan Gu
MLT
313
105
0
14 Feb 2022
Benign Overfitting without Linearity: Neural Network Classifiers Trained by Gradient Descent for Noisy Linear Data
Benign Overfitting without Linearity: Neural Network Classifiers Trained by Gradient Descent for Noisy Linear DataAnnual Conference Computational Learning Theory (COLT), 2022
Spencer Frei
Niladri S. Chatterji
Peter L. Bartlett
MLT
396
86
0
11 Feb 2022
Vision Transformer for Small-Size Datasets
Vision Transformer for Small-Size Datasets
Seung Hoon Lee
Seunghyun Lee
B. Song
ViT
185
276
0
27 Dec 2021
12
Next