ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2026 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2010.09697
  4. Cited By
Effects of Parameter Norm Growth During Transformer Training: Inductive
  Bias from Gradient Descent
v1v2v3v4v5 (latest)

Effects of Parameter Norm Growth During Transformer Training: Inductive Bias from Gradient Descent

Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020
19 October 2020
William Merrill
Vivek Ramanujan
Yoav Goldberg
Roy Schwartz
Noah A. Smith
    AI4CE
ArXiv (abs)PDFHTML

Papers citing "Effects of Parameter Norm Growth During Transformer Training: Inductive Bias from Gradient Descent"

19 / 19 papers shown
Implicitly Normalized Online PCA: A Regularized Algorithm with Exact High-Dimensional Dynamics
Samet Demir
Zafer Dogan
88
0
0
01 Dec 2025
The Transformer Cookbook
The Transformer Cookbook
Andy Yang
Christopher Watson
Anton Xue
S. Bhattamishra
Jose Llarena
William Merrill
Emile Dos Santos Ferreira
Anej Svete
David Chiang
150
0
0
01 Oct 2025
Temporal Generalization: A Reality Check
Temporal Generalization: A Reality Check
Divyam Madaan
S. Chopra
Kyunghyun Cho
OODAI4TS
136
0
0
27 Sep 2025
Large Learning Rates Simultaneously Achieve Robustness to Spurious Correlations and Compressibility
Large Learning Rates Simultaneously Achieve Robustness to Spurious Correlations and Compressibility
Melih Barsbey
Lucas Prieto
Stefanos Zafeiriou
Tolga Birdal
297
0
0
23 Jul 2025
The Counting Power of Transformers
The Counting Power of Transformers
Marco Sälzer
Chris Köcher
Anthony Widjaja Lin
Georg Zetzsche
Anthony Widjaja Lin
375
0
0
16 May 2025
How Transformers Learn Regular Language Recognition: A Theoretical Study on Training Dynamics and Implicit Bias
How Transformers Learn Regular Language Recognition: A Theoretical Study on Training Dynamics and Implicit Bias
Ruiquan Huang
Yingbin Liang
Jing Yang
641
5
0
02 May 2025
Limits of Deep Learning: Sequence Modeling through the Lens of Complexity Theory
Limits of Deep Learning: Sequence Modeling through the Lens of Complexity Theory
Nikola Zubić
Federico Soldá
Aurelio Sulser
Davide Scaramuzza
LRMBDL
381
15
0
26 May 2024
Counting Like Transformers: Compiling Temporal Counting Logic Into
  Softmax Transformers
Counting Like Transformers: Compiling Temporal Counting Logic Into Softmax Transformers
Andy Yang
David Chiang
231
28
0
05 Apr 2024
Language models scale reliably with over-training and on downstream
  tasks
Language models scale reliably with over-training and on downstream tasksInternational Conference on Learning Representations (ICLR), 2024
S. Gadre
Georgios Smyrnis
Vaishaal Shankar
Suchin Gururangan
Mitchell Wortsman
...
Y. Carmon
Achal Dave
Reinhard Heckel
Niklas Muennighoff
Ludwig Schmidt
ALMELMLRM
345
76
0
13 Mar 2024
Disentangling the Causes of Plasticity Loss in Neural Networks
Disentangling the Causes of Plasticity Loss in Neural Networks
Clare Lyle
Zeyu Zheng
Khimya Khetarpal
H. V. Hasselt
Razvan Pascanu
James Martens
Will Dabney
AI4CE
336
54
0
29 Feb 2024
Implicit Bias and Fast Convergence Rates for Self-attention
Implicit Bias and Fast Convergence Rates for Self-attention
Bhavya Vasudeva
Puneesh Deora
Christos Thrampoulidis
394
28
0
08 Feb 2024
Small-scale proxies for large-scale Transformer training instabilities
Small-scale proxies for large-scale Transformer training instabilitiesInternational Conference on Learning Representations (ICLR), 2023
Mitchell Wortsman
Peter J. Liu
Lechao Xiao
Katie Everett
A. Alemi
...
Jascha Narain Sohl-Dickstein
Kelvin Xu
Jaehoon Lee
Justin Gilmer
Simon Kornblith
319
135
0
25 Sep 2023
Language Models Understand Us, Poorly
Language Models Understand Us, PoorlyConference on Empirical Methods in Natural Language Processing (EMNLP), 2022
Jared Moore
LRM
158
5
0
19 Oct 2022
A Logic for Expressing Log-Precision Transformers
A Logic for Expressing Log-Precision TransformersNeural Information Processing Systems (NeurIPS), 2022
William Merrill
Ashish Sabharwal
ReLMNAILRM
598
69
0
06 Oct 2022
The Parallelism Tradeoff: Limitations of Log-Precision Transformers
The Parallelism Tradeoff: Limitations of Log-Precision TransformersTransactions of the Association for Computational Linguistics (TACL), 2022
William Merrill
Ashish Sabharwal
488
155
0
02 Jul 2022
Overcoming a Theoretical Limitation of Self-Attention
Overcoming a Theoretical Limitation of Self-AttentionAnnual Meeting of the Association for Computational Linguistics (ACL), 2022
David Chiang
Peter A. Cholak
264
104
0
24 Feb 2022
Extracting Finite Automata from RNNs Using State Merging
Extracting Finite Automata from RNNs Using State Merging
William Merrill
Nikolaos Tsilivis
272
19
0
28 Jan 2022
How BPE Affects Memorization in Transformers
How BPE Affects Memorization in Transformers
Eugene Kharitonov
Marco Baroni
Dieuwke Hupkes
441
37
0
06 Oct 2021
Saturated Transformers are Constant-Depth Threshold Circuits
Saturated Transformers are Constant-Depth Threshold CircuitsTransactions of the Association for Computational Linguistics (TACL), 2021
William Merrill
Ashish Sabharwal
Noah A. Smith
493
137
0
30 Jun 2021
1