ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2506.14562
20
0
v1v2 (latest)

AlphaDecay: Module-wise Weight Decay for Heavy-Tailed Balancing in LLMs

17 June 2025
Di He
Ajay Jaiswal
Songjun Tu
Li Shen
Ganzhao Yuan
Shiwei Liu
L. Yin
ArXiv (abs)PDFHTML
Main:11 Pages
11 Figures
Bibliography:3 Pages
11 Tables
Appendix:3 Pages
Abstract

Weight decay is a standard regularization technique for training large language models (LLMs). While it is common to assign a uniform decay rate to every layer, this approach overlooks the structural diversity of LLMs and the varying spectral properties across modules. In this paper, we introduce AlphaDecay, a simple yet effective method that adaptively assigns different weight decay strengths to each module of an LLM. Our approach is guided by Heavy-Tailed Self-Regularization (HT-SR) theory, which analyzes the empirical spectral density (ESD) of weight correlation matrices to quantify "heavy-tailedness." Modules exhibiting more pronounced heavy-tailed ESDs, reflecting stronger feature learning, are assigned weaker decay, while modules with lighter-tailed spectra receive stronger decay. Our method leverages tailored weight decay assignments to balance the module-wise differences in spectral properties, leading to improved performance. Extensive pre-training tasks with various model sizes from 60M to 1B demonstrate that AlphaDecay achieves better perplexity and generalization than conventional uniform decay and other adaptive decay baselines. Our code is available atthis https URL.

View on arXiv
@article{he2025_2506.14562,
  title={ AlphaDecay: Module-wise Weight Decay for Heavy-Tailed Balancing in LLMs },
  author={ Di He and Ajay Jaiswal and Songjun Tu and Li Shen and Ganzhao Yuan and Shiwei Liu and Lu Yin },
  journal={arXiv preprint arXiv:2506.14562},
  year={ 2025 }
}
Comments on this paper