v1v2v3v4v5 (latest)

Effects of Parameter Norm Growth During Transformer Training: Inductive Bias from Gradient Descent

Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020

19 October 2020

Papers citing "Effects of Parameter Norm Growth During Transformer Training: Inductive Bias from Gradient Descent"

19 / 19 papers shown

Implicitly Normalized Online PCA: A Regularized Algorithm with Exact High-Dimensional Dynamics

Samet Demir

Zafer Dogan

01 Dec 2025

The Transformer Cookbook

Emile Dos Santos Ferreira

Anej Svete

David Chiang

150

01 Oct 2025

Temporal Generalization: A Reality Check

136

27 Sep 2025

Large Learning Rates Simultaneously Achieve Robustness to Spurious Correlations and Compressibility

297

23 Jul 2025

The Counting Power of Transformers

375

16 May 2025

How Transformers Learn Regular Language Recognition: A Theoretical Study on Training Dynamics and Implicit Bias

Ruiquan Huang

Yingbin Liang

Jing Yang

641

02 May 2025

Limits of Deep Learning: Sequence Modeling through the Lens of Complexity Theory

381

26 May 2024

Counting Like Transformers: Compiling Temporal Counting Logic Into Softmax Transformers

Andy Yang

David Chiang

231

05 Apr 2024

Language models scale reliably with over-training and on downstream tasksInternational Conference on Learning Representations (ICLR), 2024

...

Niklas Muennighoff

345

13 Mar 2024

Disentangling the Causes of Plasticity Loss in Neural Networks

336

29 Feb 2024

Implicit Bias and Fast Convergence Rates for Self-attention

Bhavya Vasudeva

Puneesh Deora

Christos Thrampoulidis

394

08 Feb 2024

Small-scale proxies for large-scale Transformer training instabilitiesInternational Conference on Learning Representations (ICLR), 2023

...

Jascha Narain Sohl-Dickstein

Kelvin Xu

Jaehoon Lee

Justin Gilmer

Simon Kornblith

319

135

25 Sep 2023

Language Models Understand Us, PoorlyConference on Empirical Methods in Natural Language Processing (EMNLP), 2022

Jared Moore

LRM

158

19 Oct 2022

A Logic for Expressing Log-Precision TransformersNeural Information Processing Systems (NeurIPS), 2022

William Merrill

Ashish Sabharwal

ReLM NAI LRM

598

06 Oct 2022

The Parallelism Tradeoff: Limitations of Log-Precision TransformersTransactions of the Association for Computational Linguistics (TACL), 2022

William Merrill

Ashish Sabharwal

488

155

02 Jul 2022

Overcoming a Theoretical Limitation of Self-AttentionAnnual Meeting of the Association for Computational Linguistics (ACL), 2022

David Chiang

Peter A. Cholak

264

104

24 Feb 2022

Extracting Finite Automata from RNNs Using State Merging

William Merrill

Nikolaos Tsilivis

272

28 Jan 2022

How BPE Affects Memorization in Transformers

Eugene Kharitonov

Marco Baroni

Dieuwke Hupkes

441

06 Oct 2021

Saturated Transformers are Constant-Depth Threshold CircuitsTransactions of the Association for Computational Linguistics (TACL), 2021

William Merrill

Ashish Sabharwal

Noah A. Smith

493

137

30 Jun 2021