Where Do Large Learning Rates Lead Us?Neural Information Processing Systems (NeurIPS), 2024 |
A Learning Rate Path Switching Training Paradigm for Version Updates of
Large Language ModelsConference on Empirical Methods in Natural Language Processing (EMNLP), 2024 |
No Train No Gain: Revisiting Efficient Training Algorithms For
Transformer-based Language ModelsNeural Information Processing Systems (NeurIPS), 2023 |
Relaxed Attention for Transformer ModelsIEEE International Joint Conference on Neural Network (IJCNN), 2022 |
Training Scale-Invariant Neural Networks on the Sphere Can Happen in
Three RegimesNeural Information Processing Systems (NeurIPS), 2022 |
Understanding Decoupled and Early Weight DecayAAAI Conference on Artificial Intelligence (AAAI), 2020 |