ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 1810.12281
  4. Cited By
Three Mechanisms of Weight Decay Regularization

Three Mechanisms of Weight Decay Regularization

29 October 2018
Guodong Zhang
Chaoqi Wang
Bowen Xu
Roger C. Grosse
ArXivPDFHTML

Papers citing "Three Mechanisms of Weight Decay Regularization"

50 / 58 papers shown
Title
Power Lines: Scaling Laws for Weight Decay and Batch Size in LLM Pre-training
Power Lines: Scaling Laws for Weight Decay and Batch Size in LLM Pre-training
Shane Bergsma
Nolan Dey
Gurpreet Gosal
Gavia Gray
Daria Soboleva
Joel Hestness
27
0
0
19 May 2025
Low-Loss Space in Neural Networks is Continuous and Fully Connected
Low-Loss Space in Neural Networks is Continuous and Fully Connected
Yongding Tian
Zaid Al-Ars
Maksim Kitsak
P. Hofstee
3DPC
31
1
0
05 May 2025
Adaptive Extrapolated Proximal Gradient Methods with Variance Reduction for Composite Nonconvex Finite-Sum Minimization
Adaptive Extrapolated Proximal Gradient Methods with Variance Reduction for Composite Nonconvex Finite-Sum Minimization
Ganzhao Yuan
45
0
0
28 Feb 2025
Towards Accurate Binary Spiking Neural Networks: Learning with Adaptive Gradient Modulation Mechanism
Towards Accurate Binary Spiking Neural Networks: Learning with Adaptive Gradient Modulation Mechanism
Yu Liang
Wenjie Wei
A. Belatreche
Honglin Cao
Zijian Zhou
Shuai Wang
Malu Zhang
Yue Yang
MQ
73
0
0
21 Feb 2025
How Much Can We Forget about Data Contamination?
How Much Can We Forget about Data Contamination?
Sebastian Bordt
Suraj Srinivas
Valentyn Boreiko
U. V. Luxburg
57
1
0
04 Oct 2024
Classifying Overlapping Gaussian Mixtures in High Dimensions: From
  Optimal Classifiers to Neural Nets
Classifying Overlapping Gaussian Mixtures in High Dimensions: From Optimal Classifiers to Neural Nets
Khen Cohen
Noam Levi
Yaron Oz
BDL
39
1
0
28 May 2024
How to set AdamW's weight decay as you scale model and dataset size
How to set AdamW's weight decay as you scale model and dataset size
Xi Wang
Laurence Aitchison
51
10
0
22 May 2024
Implicit Bias of AdamW: $\ell_\infty$ Norm Constrained Optimization
Implicit Bias of AdamW: ℓ∞\ell_\inftyℓ∞​ Norm Constrained Optimization
Shuo Xie
Zhiyuan Li
OffRL
55
14
0
05 Apr 2024
Tune without Validation: Searching for Learning Rate and Weight Decay on
  Training Sets
Tune without Validation: Searching for Learning Rate and Weight Decay on Training Sets
Lorenzo Brigato
Stavroula Mougiakakou
50
0
0
08 Mar 2024
Analyzing and Improving the Training Dynamics of Diffusion Models
Analyzing and Improving the Training Dynamics of Diffusion Models
Tero Karras
M. Aittala
J. Lehtinen
Janne Hellsten
Timo Aila
S. Laine
61
162
0
05 Dec 2023
Layer-wise Adaptive Step-Sizes for Stochastic First-Order Methods for Deep Learning
Achraf Bahamou
D. Goldfarb
ODL
41
0
0
23 May 2023
MoMo: Momentum Models for Adaptive Learning Rates
MoMo: Momentum Models for Adaptive Learning Rates
Fabian Schaipp
Ruben Ohana
Michael Eickenberg
Aaron Defazio
Robert Mansel Gower
40
10
0
12 May 2023
On the Ideal Number of Groups for Isometric Gradient Propagation
On the Ideal Number of Groups for Isometric Gradient Propagation
Bum Jun Kim
Hyeyeon Choi
Hyeonah Jang
Sang Woo Kim
37
1
0
07 Feb 2023
A Stochastic Proximal Polyak Step Size
A Stochastic Proximal Polyak Step Size
Fabian Schaipp
Robert Mansel Gower
M. Ulbrich
24
12
0
12 Jan 2023
Feature Weaken: Vicinal Data Augmentation for Classification
Feature Weaken: Vicinal Data Augmentation for Classification
Songhao Jiang
Yan Chu
Tian-Hui Ma
Tianning Zang
36
0
0
20 Nov 2022
Toward Equation of Motion for Deep Neural Networks: Continuous-time
  Gradient Descent and Discretization Error Analysis
Toward Equation of Motion for Deep Neural Networks: Continuous-time Gradient Descent and Discretization Error Analysis
Taiki Miyagawa
55
9
0
28 Oct 2022
Noise Injection Node Regularization for Robust Learning
Noise Injection Node Regularization for Robust Learning
N. Levi
I. Bloch
M. Freytsis
T. Volansky
AI4CE
39
2
0
27 Oct 2022
SGD with Large Step Sizes Learns Sparse Features
SGD with Large Step Sizes Learns Sparse Features
Maksym Andriushchenko
Aditya Varre
Loucas Pillaud-Vivien
Nicolas Flammarion
50
56
0
11 Oct 2022
Scale-invariant Bayesian Neural Networks with Connectivity Tangent
  Kernel
Scale-invariant Bayesian Neural Networks with Connectivity Tangent Kernel
Sungyub Kim
Si-hun Park
Kyungsu Kim
Eunho Yang
BDL
34
4
0
30 Sep 2022
Distributed Semi-supervised Fuzzy Regression with Interpolation
  Consistency Regularization
Distributed Semi-supervised Fuzzy Regression with Interpolation Consistency Regularization
Ye-ling Shi
Leijie Zhang
Zehong Cao
M. Tanveer
Chin-Teng Lin
25
7
0
18 Sep 2022
Understanding the Generalization Benefit of Normalization Layers:
  Sharpness Reduction
Understanding the Generalization Benefit of Normalization Layers: Sharpness Reduction
Kaifeng Lyu
Zhiyuan Li
Sanjeev Arora
FAtt
54
71
0
14 Jun 2022
Guidelines for the Regularization of Gammas in Batch Normalization for
  Deep Residual Networks
Guidelines for the Regularization of Gammas in Batch Normalization for Deep Residual Networks
Bum Jun Kim
Hyeyeon Choi
Hyeonah Jang
Dong Gu Lee
Wonseok Jeong
Sang Woo Kim
24
4
0
15 May 2022
GPT-NeoX-20B: An Open-Source Autoregressive Language Model
GPT-NeoX-20B: An Open-Source Autoregressive Language Model
Sid Black
Stella Biderman
Eric Hallahan
Quentin G. Anthony
Leo Gao
...
Shivanshu Purohit
Laria Reynolds
J. Tow
Benqi Wang
Samuel Weinbach
102
803
0
14 Apr 2022
Semi-Discrete Normalizing Flows through Differentiable Tessellation
Semi-Discrete Normalizing Flows through Differentiable Tessellation
Ricky T. Q. Chen
Brandon Amos
Maximilian Nickel
32
10
0
14 Mar 2022
A Data-Augmentation Is Worth A Thousand Samples: Exact Quantification
  From Analytical Augmented Sample Moments
A Data-Augmentation Is Worth A Thousand Samples: Exact Quantification From Analytical Augmented Sample Moments
Randall Balestriero
Ishan Misra
Yann LeCun
35
20
0
16 Feb 2022
Cyclical Focal Loss
Cyclical Focal Loss
L. Smith
40
14
0
16 Feb 2022
A Geometric Understanding of Natural Gradient
A Geometric Understanding of Natural Gradient
Qinxun Bai
S. Rosenberg
Wei Xu
28
2
0
13 Feb 2022
Deep Learning to advance the Eigenspace Perturbation Method for
  Turbulence Model Uncertainty Quantification
Deep Learning to advance the Eigenspace Perturbation Method for Turbulence Model Uncertainty Quantification
Khashayar Nobarani
S. Razavi
11
0
0
11 Feb 2022
Characterizing and overcoming the greedy nature of learning in
  multi-modal deep neural networks
Characterizing and overcoming the greedy nature of learning in multi-modal deep neural networks
Nan Wu
Stanislaw Jastrzebski
Kyunghyun Cho
Krzysztof J. Geras
21
72
0
10 Feb 2022
Robust Training of Neural Networks Using Scale Invariant Architectures
Robust Training of Neural Networks Using Scale Invariant Architectures
Zhiyuan Li
Srinadh Bhojanapalli
Manzil Zaheer
Sashank J. Reddi
Surinder Kumar
29
27
0
02 Feb 2022
Gradient Descent on Neurons and its Link to Approximate Second-Order
  Optimization
Gradient Descent on Neurons and its Link to Approximate Second-Order Optimization
Frederik Benzing
ODL
50
23
0
28 Jan 2022
Target-Oriented Fine-tuning for Zero-Resource Named Entity Recognition
Target-Oriented Fine-tuning for Zero-Resource Named Entity Recognition
Ying Zhang
Fandong Meng
Jinan Xu
Jinan Xu
Jie Zhou
35
10
0
22 Jul 2021
Initialization and Regularization of Factorized Neural Layers
Initialization and Regularization of Factorized Neural Layers
M. Khodak
Neil A. Tenenholtz
Lester W. Mackey
Nicolò Fusi
69
56
0
03 May 2021
Fundamental Challenges in Deep Learning for Stiff Contact Dynamics
Fundamental Challenges in Deep Learning for Stiff Contact Dynamics
Mihir Parmar
Mathew Halm
Michael Posa
29
36
0
29 Mar 2021
FixNorm: Dissecting Weight Decay for Training Deep Neural Networks
FixNorm: Dissecting Weight Decay for Training Deep Neural Networks
Yucong Zhou
Yunxiao Sun
Zhaobai Zhong
29
6
0
29 Mar 2021
Parareal Neural Networks Emulating a Parallel-in-time Algorithm
Parareal Neural Networks Emulating a Parallel-in-time Algorithm
Zhanyu Ma
Jiyang Xie
Jingyi Yu
AI4CE
33
9
0
16 Mar 2021
Intraclass clustering: an implicit learning ability that regularizes
  DNNs
Intraclass clustering: an implicit learning ability that regularizes DNNs
Simon Carbonnelle
Christophe De Vleeschouwer
65
8
0
11 Mar 2021
Neural Mechanics: Symmetry and Broken Conservation Laws in Deep Learning
  Dynamics
Neural Mechanics: Symmetry and Broken Conservation Laws in Deep Learning Dynamics
D. Kunin
Javier Sagastuy-Breña
Surya Ganguli
Daniel L. K. Yamins
Hidenori Tanaka
114
77
0
08 Dec 2020
A Trace-restricted Kronecker-Factored Approximation to Natural Gradient
A Trace-restricted Kronecker-Factored Approximation to Natural Gradient
Kai-Xin Gao
Xiaolei Liu
Zheng-Hai Huang
Min Wang
Zidong Wang
Dachuan Xu
F. Yu
29
11
0
21 Nov 2020
A Random Matrix Theory Approach to Damping in Deep Learning
A Random Matrix Theory Approach to Damping in Deep Learning
Diego Granziol
Nicholas P. Baskerville
AI4CE
ODL
34
2
0
15 Nov 2020
AEGD: Adaptive Gradient Descent with Energy
AEGD: Adaptive Gradient Descent with Energy
Hailiang Liu
Xuping Tian
ODL
27
11
0
10 Oct 2020
Group Whitening: Balancing Learning Efficiency and Representational
  Capacity
Group Whitening: Balancing Learning Efficiency and Representational Capacity
Lei Huang
Yi Zhou
Li Liu
Fan Zhu
Ling Shao
33
21
0
28 Sep 2020
Whitening and second order optimization both make information in the
  dataset unusable during training, and can reduce or prevent generalization
Whitening and second order optimization both make information in the dataset unusable during training, and can reduce or prevent generalization
Neha S. Wadia
Daniel Duckworth
S. Schoenholz
Ethan Dyer
Jascha Narain Sohl-Dickstein
34
13
0
17 Aug 2020
Can we Estimate Truck Accident Risk from Telemetric Data using Machine
  Learning?
Can we Estimate Truck Accident Risk from Telemetric Data using Machine Learning?
Antonio Hebert
Ian Marineau
Gilles Gervais
Tristan Glatard
Brigitte Jaumard
21
2
0
17 Jul 2020
A General Family of Stochastic Proximal Gradient Methods for Deep
  Learning
A General Family of Stochastic Proximal Gradient Methods for Deep Learning
Jihun Yun
A. Lozano
Eunho Yang
24
12
0
15 Jul 2020
When Does Preconditioning Help or Hurt Generalization?
When Does Preconditioning Help or Hurt Generalization?
S. Amari
Jimmy Ba
Roger C. Grosse
Xuechen Li
Atsushi Nitanda
Taiji Suzuki
Denny Wu
Ji Xu
41
32
0
18 Jun 2020
Understanding and Mitigating Exploding Inverses in Invertible Neural
  Networks
Understanding and Mitigating Exploding Inverses in Invertible Neural Networks
Jens Behrmann
Paul Vicol
Kuan-Chieh Wang
Roger C. Grosse
J. Jacobsen
23
93
0
16 Jun 2020
New Interpretations of Normalization Methods in Deep Learning
New Interpretations of Normalization Methods in Deep Learning
Jiacheng Sun
Xiangyong Cao
Hanwen Liang
Weiran Huang
Zewei Chen
Zhenguo Li
23
35
0
16 Jun 2020
On the training dynamics of deep networks with $L_2$ regularization
On the training dynamics of deep networks with L2L_2L2​ regularization
Aitor Lewkowycz
Guy Gur-Ari
46
53
0
15 Jun 2020
On the Optimal Weighted $\ell_2$ Regularization in Overparameterized
  Linear Regression
On the Optimal Weighted ℓ2\ell_2ℓ2​ Regularization in Overparameterized Linear Regression
Denny Wu
Ji Xu
38
121
0
10 Jun 2020
12
Next