v1v2v3v4v5 (latest)

On the Overlooked Pitfalls of Weight Decay and How to Mitigate Them: A Gradient-Norm Perspective

Neural Information Processing Systems (NeurIPS), 2020

23 November 2020

Zeke Xie

Zhiqiang Xu

Papers citing "On the Overlooked Pitfalls of Weight Decay and How to Mitigate Them: A Gradient-Norm Perspective"

19 / 19 papers shown

Robust Layerwise Scaling Rules by Proper Weight Decay Tuning

146

17 Oct 2025

Cautious Weight Decay

183

14 Oct 2025

Self Identity MappingNeural Networks (NN), 2025

265

17 Sep 2025

AlphaDecay: Module-wise Weight Decay for Heavy-Tailed Balancing in LLMs

436

17 Jun 2025

Generalized Gradient Norm Clipping & Non-Euclidean

(L_0,L_1)

386

02 Jun 2025

Why Gradients Rapidly Increase Near the End of Training

Aaron Defazio

185

02 Jun 2025

NeuralGrok: Accelerate Grokking by Neural Gradient Transformation

288

24 Apr 2025

Mirror, Mirror of the Flow: How Does Regularization Shape Implicit Bias?

425

17 Apr 2025

Do we really have to filter out random noise in pre-training data for language models?

529

10 Feb 2025

Weight decay induces low-rank attention layersNeural Information Processing Systems (NeurIPS), 2024

Seijin Kobayashi

Yassir Akram

J. Oswald

304

31 Oct 2024

DARE the Extreme: Revisiting Delta-Parameter Pruning For Fine-Tuned ModelsInternational Conference on Learning Representations (ICLR), 2024

Yize Zhao

Christos Thrampoulidis

514

12 Oct 2024

mGTE: Generalized Long-Context Text Representation and Reranking Models for Multilingual Text RetrievalConference on Empirical Methods in Natural Language Processing (EMNLP), 2024

...

Fei Huang

Min Zhang

358

277

29 Jul 2024

Neural Field Classifiers via Target Encoding and Classification Loss

Zeke Xie

222

02 Mar 2024

Neural Networks with (Low-Precision) Polynomial Approximations: New Insights and Techniques for Accuracy Improvement

235

17 Feb 2024

Rotational Equilibrium: How Weight Decay Balances Learning Across Neural NetworksInternational Conference on Machine Learning (ICML), 2023

Atli Kosson

Bettina Messmer

Martin Jaggi

551

26 May 2023

On the Overlooked Structure of Stochastic GradientsNeural Information Processing Systems (NeurIPS), 2022

Zeke Xie

Qian-Yuan Tang

Mingming Sun

P. Li

339

05 Dec 2022

On effects of Knowledge Distillation on Transfer Learning

Sushil Thapa

164

18 Oct 2022

Residual-Concatenate Neural Network with Deep Regularization Layers for Binary ClassificationInternational Conference Intelligent Computing and Control Systems (ICICCS), 2022

Abhishek Gupta

Sruthi Nair

Raunak Joshi

V. Chitre

170

25 May 2022

Stochastic Training is Not Necessary for Generalization

529

29 Sep 2021