ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2026 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2310.00535
  4. Cited By
JoMA: Demystifying Multilayer Transformers via JOint Dynamics of MLP and
  Attention
v1v2v3 (latest)

JoMA: Demystifying Multilayer Transformers via JOint Dynamics of MLP and Attention

International Conference on Learning Representations (ICLR), 2023
1 October 2023
Yuandong Tian
Yiping Wang
Zhenyu Zhang
Beidi Chen
Simon Shaolei Du
ArXiv (abs)PDFHTMLHuggingFace (2 upvotes)Github (293★)

Papers citing "JoMA: Demystifying Multilayer Transformers via JOint Dynamics of MLP and Attention"

34 / 34 papers shown
Towards Understanding Transformers in Learning Random Walks
Towards Understanding Transformers in Learning Random Walks
Wei Shi
Yuan Cao
133
1
0
28 Nov 2025
From Condensation to Rank Collapse: A Two-Stage Analysis of Transformer Training Dynamics
From Condensation to Rank Collapse: A Two-Stage Analysis of Transformer Training Dynamics
Zheng-an Chen
Tao Luo
AI4CE
178
3
0
08 Oct 2025
Identity Bridge: Enabling Implicit Reasoning via Shared Latent Memory
Identity Bridge: Enabling Implicit Reasoning via Shared Latent Memory
Pengxiao Lin
Zheng Chen
Zhi-Qin John Xu
LRM
172
6
0
29 Sep 2025
Emergence of Superposition: Unveiling the Training Dynamics of Chain of Continuous Thought
Emergence of Superposition: Unveiling the Training Dynamics of Chain of Continuous Thought
Hanlin Zhu
Shibo Hao
Zhiting Hu
Jiantao Jiao
Stuart Russell
Yuandong Tian
LRM
243
9
0
27 Sep 2025
Generalization or Hallucination? Understanding Out-of-Context Reasoning in Transformers
Generalization or Hallucination? Understanding Out-of-Context Reasoning in Transformers
Yixiao Huang
Hanlin Zhu
Tianyu Guo
Jiantao Jiao
Somayeh Sojoudi
Michael I. Jordan
Stuart Russell
Song Mei
LRM
764
7
0
12 Jun 2025
Bridging Neural ODE and ResNet: A Formal Error Bound for Safety Verification
Bridging Neural ODE and ResNet: A Formal Error Bound for Safety Verification
Abdelrahman Sayed Sayed
Pierre-Jean Meyer
Mohamed Ghazel
233
0
0
03 Jun 2025
Taming Transformer Without Using Learning Rate Warmup
Taming Transformer Without Using Learning Rate WarmupInternational Conference on Learning Representations (ICLR), 2025
Xianbiao Qi
Yelin He
Jiaquan Ye
Chun-Guang Li
Bojia Zi
Xili Dai
Qin Zou
Rong Xiao
227
6
0
28 May 2025
How Transformers Learn In-Context Recall Tasks? Optimality, Training Dynamics and Generalization
How Transformers Learn In-Context Recall Tasks? Optimality, Training Dynamics and Generalization
Quan Nguyen
Thanh Nguyen-Tang
MLT
577
1
0
21 May 2025
How Transformers Learn Regular Language Recognition: A Theoretical Study on Training Dynamics and Implicit Bias
How Transformers Learn Regular Language Recognition: A Theoretical Study on Training Dynamics and Implicit Bias
Ruiquan Huang
Yingbin Liang
Jing Yang
765
5
0
02 May 2025
Gating is Weighting: Understanding Gated Linear Attention through In-context Learning
Gating is Weighting: Understanding Gated Linear Attention through In-context Learning
Yingcong Li
Davoud Ataee Tarzanagh
A. S. Rawat
Maryam Fazel
Samet Oymak
292
9
0
06 Apr 2025
On the Surprising Effectiveness of Attention Transfer for Vision
  Transformers
On the Surprising Effectiveness of Attention Transfer for Vision TransformersNeural Information Processing Systems (NeurIPS), 2024
Alexander C. Li
Yuandong Tian
Bin Chen
Deepak Pathak
Xinlei Chen
253
15
0
14 Nov 2024
Active-Dormant Attention Heads: Mechanistically Demystifying
  Extreme-Token Phenomena in LLMs
Active-Dormant Attention Heads: Mechanistically Demystifying Extreme-Token Phenomena in LLMs
Tianyu Guo
Druv Pai
Yu Bai
Jiantao Jiao
Michael I. Jordan
Song Mei
360
41
0
17 Oct 2024
On the Learn-to-Optimize Capabilities of Transformers in In-Context Sparse Recovery
On the Learn-to-Optimize Capabilities of Transformers in In-Context Sparse RecoveryInternational Conference on Learning Representations (ICLR), 2024
Renpu Liu
Ruida Zhou
Cong Shen
Jing Yang
592
4
0
17 Oct 2024
A Theoretical Survey on Foundation Models
A Theoretical Survey on Foundation Models
Shi Fu
Yuzhu Chen
Yingjie Wang
Dacheng Tao
383
0
0
15 Oct 2024
Bypassing the Exponential Dependency: Looped Transformers Efficiently Learn In-context by Multi-step Gradient Descent
Bypassing the Exponential Dependency: Looped Transformers Efficiently Learn In-context by Multi-step Gradient DescentInternational Conference on Artificial Intelligence and Statistics (AISTATS), 2024
Bo Chen
Xiaoyu Li
Yingyu Liang
Zhenmei Shi
Zhao Song
571
27
0
15 Oct 2024
Training Nonlinear Transformers for Chain-of-Thought Inference: A Theoretical Generalization Analysis
Training Nonlinear Transformers for Chain-of-Thought Inference: A Theoretical Generalization AnalysisInternational Conference on Learning Representations (ICLR), 2024
Hongkang Li
Songtao Lu
Pin-Yu Chen
Xiaodong Cui
Meng Wang
LRM
619
15
0
03 Oct 2024
Non-asymptotic Convergence of Training Transformers for Next-token
  Prediction
Non-asymptotic Convergence of Training Transformers for Next-token PredictionNeural Information Processing Systems (NeurIPS), 2024
Ruiquan Huang
Yingbin Liang
Jing Yang
344
13
0
25 Sep 2024
On the Power of Convolution Augmented Transformer
On the Power of Convolution Augmented Transformer
Mingchen Li
Xuechen Zhang
Yixiao Huang
Samet Oymak
274
7
0
08 Jul 2024
Do LLMs dream of elephants (when told not to)? Latent concept
  association and associative memory in transformers
Do LLMs dream of elephants (when told not to)? Latent concept association and associative memory in transformers
Yibo Jiang
Goutham Rajendran
Pradeep Ravikumar
Bryon Aragam
CLLKELM
331
14
0
26 Jun 2024
Learning on Transformers is Provable Low-Rank and Sparse: A One-layer
  Analysis
Learning on Transformers is Provable Low-Rank and Sparse: A One-layer Analysis
Hongkang Li
Meng Wang
Shuai Zhang
Sijia Liu
Pin-Yu Chen
316
11
0
24 Jun 2024
Enhancing In-Context Learning Performance with just SVD-Based Weight
  Pruning: A Theoretical Perspective
Enhancing In-Context Learning Performance with just SVD-Based Weight Pruning: A Theoretical Perspective
Xinhao Yao
Xiaolin Hu
Shenzhi Yang
Yong Liu
329
4
0
06 Jun 2024
Why Larger Language Models Do In-context Learning Differently?
Why Larger Language Models Do In-context Learning Differently?
Zhenmei Shi
Junyi Wei
Zhuoyan Xu
Yingyu Liang
301
53
0
30 May 2024
Understanding and Minimising Outlier Features in Neural Network Training
Understanding and Minimising Outlier Features in Neural Network Training
Bobby He
Lorenzo Noci
Daniele Paliotta
Imanol Schlag
Thomas Hofmann
377
11
0
29 May 2024
Towards a Theoretical Understanding of the 'Reversal Curse' via Training
  Dynamics
Towards a Theoretical Understanding of the 'Reversal Curse' via Training Dynamics
Hanlin Zhu
Baihe Huang
Shaolun Zhang
Michael I. Jordan
Jiantao Jiao
Yuandong Tian
Stuart Russell
LRMAI4CE
327
30
0
07 May 2024
GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection
GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection
Jiawei Zhao
Zhenyu Zhang
Beidi Chen
Zinan Lin
A. Anandkumar
Yuandong Tian
567
416
0
06 Mar 2024
Found in the Middle: How Language Models Use Long Contexts Better via
  Plug-and-Play Positional Encoding
Found in the Middle: How Language Models Use Long Contexts Better via Plug-and-Play Positional Encoding
Zhenyu Zhang
Runjin Chen
Shiwei Liu
Zhewei Yao
Olatunji Ruwase
Beidi Chen
Xiaoxia Wu
Zinan Lin
331
74
0
05 Mar 2024
Towards Tracing Trustworthiness Dynamics: Revisiting Pre-training Period
  of Large Language Models
Towards Tracing Trustworthiness Dynamics: Revisiting Pre-training Period of Large Language Models
Chao Qian
Jie Zhang
Wei Yao
Dongrui Liu
Zhen-fei Yin
Yu Qiao
Yong Liu
Jing Shao
LLMSVLRM
272
16
0
29 Feb 2024
On the Societal Impact of Open Foundation Models
On the Societal Impact of Open Foundation Models
Sayash Kapoor
Rishi Bommasani
Kevin Klyman
Shayne Longpre
Ashwin Ramaswami
...
Victor Storchan
Daniel Zhang
Mark A. Lemley
Abigail Z. Jacobs
Arvind Narayanan
385
94
0
27 Feb 2024
How Do Nonlinear Transformers Learn and Generalize in In-Context
  Learning?
How Do Nonlinear Transformers Learn and Generalize in In-Context Learning?
Hongkang Li
Meng Wang
Songtao Lu
Xiaodong Cui
Pin-Yu Chen
MLT
541
37
0
23 Feb 2024
Linear Transformers are Versatile In-Context Learners
Linear Transformers are Versatile In-Context Learners
Max Vladymyrov
J. Oswald
Mark Sandler
Rong Ge
257
32
0
21 Feb 2024
Implicit Bias and Fast Convergence Rates for Self-attention
Implicit Bias and Fast Convergence Rates for Self-attention
Bhavya Vasudeva
Puneesh Deora
Christos Thrampoulidis
544
31
0
08 Feb 2024
Self-attention Networks Localize When QK-eigenspectrum Concentrates
Self-attention Networks Localize When QK-eigenspectrum Concentrates
Han Bao
Ryuichiro Hataya
Ryo Karakida
254
16
0
03 Feb 2024
An Information-Theoretic Analysis of In-Context Learning
An Information-Theoretic Analysis of In-Context LearningInternational Conference on Machine Learning (ICML), 2024
Hong Jun Jeon
Jason D. Lee
Qi Lei
Benjamin Van Roy
434
39
0
28 Jan 2024
The Expressibility of Polynomial based Attention Scheme
The Expressibility of Polynomial based Attention Scheme
Zhao Song
Guangyi Xu
Junze Yin
402
6
0
30 Oct 2023
1
Page 1 of 1