Linear attention is (maybe) all you need (to understand transformer optimization)

2 October 2023

Papers citing "Linear attention is (maybe) all you need (to understand transformer optimization)"

42 / 42 papers shown

Title
On the Convergence of Adam-Type Algorithm for Bilevel Optimization under Unbounded Smoothness Xiaochuan Gong Jie Hao Mingrui Liu 54 0 0 05 Mar 2025
When Can You Get Away with Low Memory Adam? Dayal Singh Kalra John Kirchenbauer M. Barkeshli Tom Goldstein 69 0 0 03 Mar 2025
Position: Solve Layerwise Linear Models First to Understand Neural Dynamical Phenomena (Neural Collapse, Emergence, Lazy/Rich Regime, and Grokking) Yoonsoo Nam Seok Hyeong Lee Clementine Domine Yea Chan Park Charles London Wonyl Choi Niclas Goring Seungjai Lee AI4CE 33 0 0 28 Feb 2025
Understanding Why Adam Outperforms SGD: Gradient Heterogeneity in Transformers Akiyoshi Tomihari Issei Sato ODL 59 0 0 31 Jan 2025
Training Dynamics of In-Context Learning in Linear Attention Yedi Zhang Aaditya K. Singh Peter E. Latham Andrew Saxe MLT 62 1 0 28 Jan 2025
Pretrained transformer efficiently learns low-dimensional target functions in-context Kazusato Oko Yujin Song Taiji Suzuki Denny Wu 39 4 0 04 Nov 2024
On the Role of Depth and Looping for In-Context Learning with Task Diversity Khashayar Gatmiry Nikunj Saunshi Sashank J. Reddi Stefanie Jegelka Sanjiv Kumar 22 2 0 29 Oct 2024
Active-Dormant Attention Heads: Mechanistically Demystifying Extreme-Token Phenomena in LLMs Tianyu Guo Druv Pai Yu Bai Jiantao Jiao Michael I. Jordan Song Mei 29 9 0 17 Oct 2024
From Gradient Clipping to Normalization for Heavy Tailed SGD Florian Hübler Ilyas Fatkhullin Niao He 40 5 0 17 Oct 2024
Learning Linear Attention in Polynomial Time Morris Yau Ekin Akyürek Jiayuan Mao Joshua B. Tenenbaum Stefanie Jegelka Jacob Andreas 17 2 0 14 Oct 2024
What Does It Mean to Be a Transformer? Insights from a Theoretical Hessian Analysis Weronika Ormaniec Felix Dangel Sidak Pal Singh 33 6 0 14 Oct 2024
Can Looped Transformers Learn to Implement Multi-step Gradient Descent for In-context Learning? Khashayar Gatmiry Nikunj Saunshi Sashank J. Reddi Stefanie Jegelka Sanjiv Kumar 67 17 0 10 Oct 2024
Spin glass model of in-context learning Yuhao Li Ruoran Bai Haiping Huang LRM 42 0 0 05 Aug 2024
Deconstructing What Makes a Good Optimizer for Language Models Rosie Zhao Depen Morwani David Brandfonbrener Nikhil Vyas Sham Kakade 42 17 0 10 Jul 2024
Adam-mini: Use Fewer Learning Rates To Gain More Yushun Zhang Congliang Chen Ziniu Li Tian Ding Chenwei Wu Yinyu Ye Zhi-Quan Luo Ruoyu Sun 34 33 0 24 Jun 2024
Explicitly Encoding Structural Symmetry is Key to Length Generalization in Arithmetic Tasks Mahdi Sabbaghi George Pappas Hamed Hassani Surbhi Goel 36 4 0 04 Jun 2024
Why Larger Language Models Do In-context Learning Differently? Zhenmei Shi Junyi Wei Zhuoyan Xu Yingyu Liang 37 18 0 30 May 2024
A Theoretical Understanding of Self-Correction through In-context Alignment Yifei Wang Yuyang Wu Zeming Wei Stefanie Jegelka Yisen Wang LRM 28 13 0 28 May 2024
On Mesa-Optimization in Autoregressively Trained Transformers: Emergence and Capability Chenyu Zheng Wei Huang Rongzheng Wang Guoqiang Wu Jun Zhu Chongxuan Li 34 1 0 27 May 2024
Does SGD really happen in tiny subspaces? Minhak Song Kwangjun Ahn Chulhee Yun 56 4 1 25 May 2024
Transfer Learning Beyond Bounded Density Ratios Alkis Kalavasis Ilias Zadik Manolis Zampetakis 33 4 0 18 Mar 2024
Heavy-Tailed Class Imbalance and Why Adam Outperforms Gradient Descent on Language Models Frederik Kunstner Robin Yadav Alan Milligan Mark Schmidt Alberto Bietti 29 26 0 29 Feb 2024
Why Transformers Need Adam: A Hessian Perspective Yushun Zhang Congliang Chen Tian Ding Ziniu Li Ruoyu Sun Zhimin Luo 24 40 0 26 Feb 2024
AdAdaGrad: Adaptive Batch Size Schemes for Adaptive Gradient Methods Tim Tsz-Kit Lau Han Liu Mladen Kolar ODL 24 6 0 17 Feb 2024
SAMformer: Unlocking the Potential of Transformers in Time Series Forecasting with Sharpness-Aware Minimization and Channel-Wise Attention Romain Ilbert Ambroise Odonnat Vasilii Feofanov Aladin Virmaux Giuseppe Paolo Themis Palpanas I. Redko AI4TS 39 21 0 15 Feb 2024
Towards Understanding Inductive Bias in Transformers: A View From Infinity Itay Lavie Guy Gur-Ari Z. Ringel 32 1 0 07 Feb 2024
Understanding Adam Optimizer via Online Learning of Updates: Adam is FTRL in Disguise Kwangjun Ahn Zhiyu Zhang Yunbum Kook Yan Dai 37 11 0 02 Feb 2024
Transformers Learn Nonlinear Features In Context: Nonconvex Mean-field Dynamics on the Attention Landscape Juno Kim Taiji Suzuki 18 18 0 02 Feb 2024
Theoretical Understanding of In-Context Learning in Shallow Transformers with Unstructured Data Yue Xing Xiaofeng Lin Chenheng Xu Namjoon Suh Qifan Song Guang Cheng 11 3 0 01 Feb 2024
Superiority of Multi-Head Attention in In-Context Linear Regression Yingqian Cui Jie Ren Pengfei He Jiliang Tang Yue Xing 34 12 0 30 Jan 2024
Setting the Record Straight on Transformer Oversmoothing G. Dovonon M. Bronstein Matt J. Kusner 20 5 0 09 Jan 2024
Transformers Implement Functional Gradient Descent to Learn Non-Linear Functions In Context Xiang Cheng Yuxin Chen S. Sra 18 35 0 11 Dec 2023
Uncovering mesa-optimization algorithms in Transformers J. Oswald Eyvind Niklasson Maximilian Schlegel Seijin Kobayashi Nicolas Zucchet ... Mark Sandler Blaise Agüera y Arcas Max Vladymyrov Razvan Pascanu João Sacramento 17 53 0 11 Sep 2023
Exposing Attention Glitches with Flip-Flop Language Modeling Bingbin Liu Jordan T. Ash Surbhi Goel A. Krishnamurthy Cyril Zhang LRM 27 46 0 01 Jun 2023
Transformers learn to implement preconditioned gradient descent for in-context learning Kwangjun Ahn Xiang Cheng Hadi Daneshmand S. Sra ODL 17 147 0 01 Jun 2023
The Crucial Role of Normalization in Sharpness-Aware Minimization Yan Dai Kwangjun Ahn S. Sra 21 17 0 24 May 2023
Convergence of Adam Under Relaxed Assumptions Haochuan Li Alexander Rakhlin Ali Jadbabaie 29 53 0 27 Apr 2023
Noise Is Not the Main Factor Behind the Gap Between SGD and Adam on Transformers, but Sign Descent Might Be Frederik Kunstner Jacques Chen J. Lavington Mark W. Schmidt 40 67 0 27 Apr 2023
Sparks of Artificial General Intelligence: Early experiments with GPT-4 Sébastien Bubeck Varun Chandrasekaran Ronen Eldan J. Gehrke Eric Horvitz ... Scott M. Lundberg Harsha Nori Hamid Palangi Marco Tulio Ribeiro Yi Zhang ELM AI4MH AI4CE ALM 245 2,232 0 22 Mar 2023
How Do Transformers Learn Topic Structure: Towards a Mechanistic Understanding Yuchen Li Yuan-Fang Li Andrej Risteski 107 61 0 07 Mar 2023
Learning threshold neurons via the "edge of stability" Kwangjun Ahn Sébastien Bubeck Sinho Chewi Y. Lee Felipe Suarez Yi Zhang MLT 31 36 0 14 Dec 2022
Scaling Laws for Neural Language Models Jared Kaplan Sam McCandlish T. Henighan Tom B. Brown B. Chess R. Child Scott Gray Alec Radford Jeff Wu Dario Amodei 226 4,453 0 23 Jan 2020