Wide-minima Density Hypothesis and the Explore-Exploit Learning Rate
Schedule

v1v2v3v4v5 (latest)

Wide-minima Density Hypothesis and the Explore-Exploit Learning Rate Schedule

9 March 2020

Ramachandran Ramjee

Muthian Sivathanu

ArXiv (abs)PDF HTML

Papers citing "Wide-minima Density Hypothesis and the Explore-Exploit Learning Rate Schedule"

18 / 18 papers shown

Title
Escaping The Big Data Paradigm in Self-Supervised Representation Learning Carlos Vélez García Miguel Cazorla Jorge Pomares 85 0 0 25 Feb 2025
Where Do Large Learning Rates Lead Us? Ildus Sadrtdinov M. Kodryan Eduard Pokonechny E. Lobacheva Dmitry Vetrov AI4CE 90 1 0 29 Oct 2024
A Learning Rate Path Switching Training Paradigm for Version Updates of Large Language Models Zhihao Wang Shiyu Liu Jianheng Huang Zheng Wang Yixuan Liao Xiaoxin Chen Junfeng Yao Jinsong Su 82 1 0 05 Oct 2024
Sentence-Level or Token-Level? A Comprehensive Study on Knowledge Distillation Jingxuan Wei Linzhuang Sun Yichong Leng Xu Tan Bihui Yu Ruifeng Guo 90 4 0 23 Apr 2024
Learning to Deliver: a Foundation Model for the Montreal Capacitated Vehicle Routing Problem Samuel J. K. Chin Matthias Winkenbach Akash Srivastava 59 0 0 28 Feb 2024
Unraveling Key Factors of Knowledge Distillation Jingxuan Wei Linzhuang Sun Xu Tan Bihui Yu Ruifeng Guo 32 0 0 14 Dec 2023
Large Learning Rates Improve Generalization: But How Large Are We Talking About? E. Lobacheva Eduard Pockonechnyy M. Kodryan Dmitry Vetrov AI4CE 33 0 0 19 Nov 2023
No Data Augmentation? Alternative Regularizations for Effective Training on Small Datasets Lorenzo Brigato Stavroula Mougiakakou 77 5 0 04 Sep 2023
No Train No Gain: Revisiting Efficient Training Algorithms For Transformer-based Language Models Jean Kaddour Oscar Key Piotr Nawrot Pasquale Minervini Matt J. Kusner 105 45 0 12 Jul 2023
Relaxed Attention for Transformer Models Timo Lohrenz Björn Möller Zhengyang Li Tim Fingscheidt KELM 53 11 0 20 Sep 2022
Training Scale-Invariant Neural Networks on the Sphere Can Happen in Three Regimes M. Kodryan E. Lobacheva M. Nakhodnov Dmitry Vetrov 105 17 0 08 Sep 2022
Distance Learner: Incorporating Manifold Prior to Model Training Aditya Chetan Nipun Kwatra 31 1 0 14 Jul 2022
Efficient Multi-Purpose Cross-Attention Based Image Alignment Block for Edge Devices Bahri Batuhan Bilecen Alparslan Fisne Mustafa Ayazoglu 65 2 0 01 Jun 2022
IMDeception: Grouped Information Distilling Super-Resolution Network Mustafa Ayazoglu 88 5 0 25 Apr 2022
Ranger21: a synergistic deep learning optimizer Less Wright Nestor Demeure ODL AI4CE 104 87 0 25 Jun 2021
LRTuner: A Learning Rate Tuner for Deep Neural Networks Nikhil Iyer V. Thejas Nipun Kwatra Ramachandran Ramjee Muthian Sivathanu ODL 45 1 0 30 May 2021
Understanding Decoupled and Early Weight Decay Johan Bjorck Kilian Q. Weinberger Carla P. Gomes 61 25 0 27 Dec 2020
BERT-JAM: Boosting BERT-Enhanced Neural Machine Translation with Joint Attention Zhebin Zhang Sai Wu Dawei Jiang Gang Chen 46 0 0 09 Nov 2020