Signal Propagation in Transformers: Theoretical Perspectives and the Role of Rank Collapse

7 June 2022

Antonio Orvieto

Papers citing "Signal Propagation in Transformers: Theoretical Perspectives and the Role of Rank Collapse"

23 / 23 papers shown

Title
Don't be lazy: CompleteP enables compute-efficient deep transformers Nolan Dey Bin Claire Zhang Lorenzo Noci Mufan Bill Li Blake Bordelon Shane Bergsma C. Pehlevan Boris Hanin Joel Hestness 39 0 0 02 May 2025
Revisiting Transformers through the Lens of Low Entropy and Dynamic Sparsity Ruifeng Ren Yong Liu 114 0 0 26 Apr 2025
HybridNorm: Towards Stable and Efficient Transformer Training via Hybrid Normalization Zhijian Zhuo Yutao Zeng Ya Wang Sijun Zhang Jian Yang Xiaoqing Li Xun Zhou Jinwen Ma 46 0 0 06 Mar 2025
FlexiDiT: Your Diffusion Transformer Can Easily Generate High-Quality Samples with Less Compute Sotiris Anagnostidis Gregor Bachmann Yeongmin Kim Jonas Kohler Markos Georgopoulos A. Sanakoyeu Yuming Du Albert Pumarola Ali K. Thabet Edgar Schönfeld 87 0 0 27 Feb 2025
The Geometry of Tokens in Internal Representations of Large Language Models Karthik Viswanathan Yuri Gardinazzi Giada Panerai Alberto Cazzaniga Matteo Biagetti AIFin 88 4 0 17 Jan 2025
Activating Self-Attention for Multi-Scene Absolute Pose Regression Miso Lee Jihwan Kim Jae-Pil Heo ViT 29 0 0 03 Nov 2024
What Does It Mean to Be a Transformer? Insights from a Theoretical Hessian Analysis Weronika Ormaniec Felix Dangel Sidak Pal Singh 33 6 0 14 Oct 2024
Lambda-Skip Connections: the architectural component that prevents Rank Collapse Federico Arangath Joseph Jerome Sieber M. Zeilinger Carmen Amo Alonso 33 0 0 14 Oct 2024
Theoretical Insights into Fine-Tuning Attention Mechanism: Generalization and Optimization Xinhao Yao Hongjin Qian Xiaolin Hu Gengze Xu Wei Liu Jian Luan B. Wang Y. Liu 48 0 0 03 Oct 2024
Sampling Foundational Transformer: A Theoretical Perspective Viet Anh Nguyen Minh Lenhat Khoa Nguyen Duong Duc Hieu Dao Huu Hung Truong Son-Hy 42 0 0 11 Aug 2024
An Adaptive Stochastic Gradient Method with Non-negative Gauss-Newton Stepsizes Antonio Orvieto Lin Xiao 37 2 0 05 Jul 2024
Reasoning in Large Language Models: A Geometric Perspective Romain Cosentino Sarath Shekkizhar LRM 44 2 0 02 Jul 2024
Understanding and Minimising Outlier Features in Neural Network Training Bobby He Lorenzo Noci Daniele Paliotta Imanol Schlag Thomas Hofmann 34 3 0 29 May 2024
Infinite Limits of Multi-head Transformer Dynamics Blake Bordelon Hamza Tahir Chaudhry C. Pehlevan AI4CE 42 9 0 24 May 2024
Heavy-Tailed Class Imbalance and Why Adam Outperforms Gradient Descent on Language Models Frederik Kunstner Robin Yadav Alan Milligan Mark Schmidt Alberto Bietti 31 26 0 29 Feb 2024
Setting the Record Straight on Transformer Oversmoothing G. Dovonon M. Bronstein Matt J. Kusner 20 5 0 09 Jan 2024
Graph Convolutions Enrich the Self-Attention in Transformers! Jeongwhan Choi Hyowon Wi Jayoung Kim Yehjin Shin Kookjin Lee Nathaniel Trask Noseong Park 25 4 0 07 Dec 2023
Rank Collapse Causes Over-Smoothing and Over-Correlation in Graph Neural Networks Andreas Roth Thomas Liebig 37 11 0 31 Aug 2023
How to Protect Copyright Data in Optimization of Large Language Models? T. Chu Zhao-quan Song Chiwun Yang 32 29 0 23 Aug 2023
Dynamic Context Pruning for Efficient and Interpretable Autoregressive Transformers Sotiris Anagnostidis Dario Pavllo Luca Biggio Lorenzo Noci Aurélien Lucchi Thomas Hofmann 34 53 0 25 May 2023
Formal Mathematics Statement Curriculum Learning Stanislas Polu Jesse Michael Han Kunhao Zheng Mantas Baksys Igor Babuschkin Ilya Sutskever AIMat 79 115 0 03 Feb 2022
Stable ResNet Soufiane Hayou Eugenio Clerico Bo He George Deligiannidis Arnaud Doucet Judith Rousseau ODL SSeg 46 51 0 24 Oct 2020
Dynamical Isometry and a Mean Field Theory of CNNs: How to Train 10,000-Layer Vanilla Convolutional Neural Networks Lechao Xiao Yasaman Bahri Jascha Narain Sohl-Dickstein S. Schoenholz Jeffrey Pennington 220 348 0 14 Jun 2018