Communities
Connect sessions
AI calendar
Organizations
Join Slack
Contact Sales
Search
Open menu
Home
Papers
2310.00535
Cited By
v1
v2
v3 (latest)
JoMA: Demystifying Multilayer Transformers via JOint Dynamics of MLP and Attention
International Conference on Learning Representations (ICLR), 2023
1 October 2023
Yuandong Tian
Yiping Wang
Zhenyu Zhang
Beidi Chen
Simon Shaolei Du
Re-assign community
ArXiv (abs)
PDF
HTML
HuggingFace (2 upvotes)
Github (293★)
Papers citing
"JoMA: Demystifying Multilayer Transformers via JOint Dynamics of MLP and Attention"
33 / 33 papers shown
Title
From Condensation to Rank Collapse: A Two-Stage Analysis of Transformer Training Dynamics
Zheng-an Chen
Tao Luo
AI4CE
68
1
0
08 Oct 2025
Identity Bridge: Enabling Implicit Reasoning via Shared Latent Memory
Pengxiao Lin
Zheng Chen
Zhi-Qin John Xu
LRM
56
1
0
29 Sep 2025
Emergence of Superposition: Unveiling the Training Dynamics of Chain of Continuous Thought
Hanlin Zhu
Shibo Hao
Zhiting Hu
Jiantao Jiao
Stuart Russell
Yuandong Tian
LRM
113
2
0
27 Sep 2025
Generalization or Hallucination? Understanding Out-of-Context Reasoning in Transformers
Yixiao Huang
Hanlin Zhu
Tianyu Guo
Jiantao Jiao
Somayeh Sojoudi
Michael I. Jordan
Stuart Russell
Song Mei
LRM
437
4
0
12 Jun 2025
Bridging Neural ODE and ResNet: A Formal Error Bound for Safety Verification
Abdelrahman Sayed Sayed
Pierre-Jean Meyer
Mohamed Ghazel
163
0
0
03 Jun 2025
Taming Transformer Without Using Learning Rate Warmup
International Conference on Learning Representations (ICLR), 2025
Xianbiao Qi
Yelin He
Jiaquan Ye
Chun-Guang Li
Bojia Zi
Xili Dai
Qin Zou
Rong Xiao
156
3
0
28 May 2025
How Transformers Learn In-Context Recall Tasks? Optimality, Training Dynamics and Generalization
Quan Nguyen
Thanh Nguyen-Tang
MLT
286
1
0
21 May 2025
How Transformers Learn Regular Language Recognition: A Theoretical Study on Training Dynamics and Implicit Bias
Ruiquan Huang
Yingbin Liang
Jing Yang
484
4
0
02 May 2025
Gating is Weighting: Understanding Gated Linear Attention through In-context Learning
Yingcong Li
Davoud Ataee Tarzanagh
A. S. Rawat
Maryam Fazel
Samet Oymak
147
4
0
06 Apr 2025
On the Surprising Effectiveness of Attention Transfer for Vision Transformers
Neural Information Processing Systems (NeurIPS), 2024
Alexander C. Li
Yuandong Tian
Bin Chen
Deepak Pathak
Xinlei Chen
161
8
0
14 Nov 2024
Active-Dormant Attention Heads: Mechanistically Demystifying Extreme-Token Phenomena in LLMs
Tianyu Guo
Druv Pai
Yu Bai
Jiantao Jiao
Michael I. Jordan
Song Mei
246
24
0
17 Oct 2024
On the Learn-to-Optimize Capabilities of Transformers in In-Context Sparse Recovery
International Conference on Learning Representations (ICLR), 2024
Renpu Liu
Ruida Zhou
Cong Shen
Jing Yang
365
3
0
17 Oct 2024
A Theoretical Survey on Foundation Models
Shi Fu
Yuzhu Chen
Yingjie Wang
Dacheng Tao
243
0
0
15 Oct 2024
Bypassing the Exponential Dependency: Looped Transformers Efficiently Learn In-context by Multi-step Gradient Descent
International Conference on Artificial Intelligence and Statistics (AISTATS), 2024
Bo Chen
Xiaoyu Li
Yingyu Liang
Zhenmei Shi
Zhao Song
320
27
0
15 Oct 2024
Training Nonlinear Transformers for Chain-of-Thought Inference: A Theoretical Generalization Analysis
International Conference on Learning Representations (ICLR), 2024
Hongkang Li
Songtao Lu
Pin-Yu Chen
Xiaodong Cui
Meng Wang
LRM
323
11
0
03 Oct 2024
Non-asymptotic Convergence of Training Transformers for Next-token Prediction
Neural Information Processing Systems (NeurIPS), 2024
Ruiquan Huang
Yingbin Liang
Jing Yang
210
10
0
25 Sep 2024
On the Power of Convolution Augmented Transformer
Mingchen Li
Xuechen Zhang
Yixiao Huang
Samet Oymak
172
5
0
08 Jul 2024
Do LLMs dream of elephants (when told not to)? Latent concept association and associative memory in transformers
Yibo Jiang
Goutham Rajendran
Pradeep Ravikumar
Bryon Aragam
CLL
KELM
242
12
0
26 Jun 2024
Learning on Transformers is Provable Low-Rank and Sparse: A One-layer Analysis
Hongkang Li
Meng Wang
Shuai Zhang
Sijia Liu
Pin-Yu Chen
211
8
0
24 Jun 2024
Enhancing In-Context Learning Performance with just SVD-Based Weight Pruning: A Theoretical Perspective
Xinhao Yao
Xiaolin Hu
Shenzhi Yang
Yong Liu
180
3
0
06 Jun 2024
Why Larger Language Models Do In-context Learning Differently?
Zhenmei Shi
Junyi Wei
Zhuoyan Xu
Yingyu Liang
187
43
0
30 May 2024
Understanding and Minimising Outlier Features in Neural Network Training
Bobby He
Lorenzo Noci
Daniele Paliotta
Imanol Schlag
Thomas Hofmann
205
8
0
29 May 2024
Towards a Theoretical Understanding of the 'Reversal Curse' via Training Dynamics
Hanlin Zhu
Baihe Huang
Shaolun Zhang
Michael I. Jordan
Jiantao Jiao
Yuandong Tian
Stuart Russell
LRM
AI4CE
221
24
0
07 May 2024
GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection
Jiawei Zhao
Zhenyu Zhang
Beidi Chen
Zinan Lin
A. Anandkumar
Yuandong Tian
325
322
0
06 Mar 2024
Found in the Middle: How Language Models Use Long Contexts Better via Plug-and-Play Positional Encoding
Zhenyu Zhang
Runjin Chen
Shiwei Liu
Zhewei Yao
Olatunji Ruwase
Beidi Chen
Xiaoxia Wu
Zinan Lin
203
58
0
05 Mar 2024
Towards Tracing Trustworthiness Dynamics: Revisiting Pre-training Period of Large Language Models
Chao Qian
Jie Zhang
Wei Yao
Dongrui Liu
Zhen-fei Yin
Yu Qiao
Yong Liu
Jing Shao
LLMSV
LRM
160
15
0
29 Feb 2024
On the Societal Impact of Open Foundation Models
Sayash Kapoor
Rishi Bommasani
Kevin Klyman
Shayne Longpre
Ashwin Ramaswami
...
Victor Storchan
Daniel Zhang
James Grimmelmann
Abigail Z. Jacobs
Arvind Narayanan
199
79
0
27 Feb 2024
How Do Nonlinear Transformers Learn and Generalize in In-Context Learning?
Hongkang Li
Meng Wang
Songtao Lu
Xiaodong Cui
Pin-Yu Chen
MLT
371
31
0
23 Feb 2024
Linear Transformers are Versatile In-Context Learners
Max Vladymyrov
J. Oswald
Mark Sandler
Rong Ge
154
27
0
21 Feb 2024
Implicit Bias and Fast Convergence Rates for Self-attention
Bhavya Vasudeva
Puneesh Deora
Christos Thrampoulidis
281
26
0
08 Feb 2024
Self-attention Networks Localize When QK-eigenspectrum Concentrates
Han Bao
Ryuichiro Hataya
Ryo Karakida
115
10
0
03 Feb 2024
An Information-Theoretic Analysis of In-Context Learning
International Conference on Machine Learning (ICML), 2024
Hong Jun Jeon
Jason D. Lee
Qi Lei
Benjamin Van Roy
301
33
0
28 Jan 2024
The Expressibility of Polynomial based Attention Scheme
Zhao Song
Guangyi Xu
Junze Yin
266
7
0
30 Oct 2023
1