Straight to Zero: Why Linearly Decaying the Learning Rate to Zero Works Best for LLMsInternational Conference on Learning Representations (ICLR), 2025 |
A Tale of Tails: Model Collapse as a Change of Scaling LawsInternational Conference on Machine Learning (ICML), 2024 |
Beyond Chinchilla-Optimal: Accounting for Inference in Language Model Scaling LawsInternational Conference on Machine Learning (ICML), 2023 |
Small-scale proxies for large-scale Transformer training instabilitiesInternational Conference on Learning Representations (ICLR), 2023 |
Scaling Laws for Sparsely-Connected Foundation ModelsInternational Conference on Learning Representations (ICLR), 2023 |
Scaling Data-Constrained Language ModelsNeural Information Processing Systems (NeurIPS), 2023 |
GQA: Training Generalized Multi-Query Transformer Models from Multi-Head
CheckpointsConference on Empirical Methods in Natural Language Processing (EMNLP), 2023 |
Getting ViT in Shape: Scaling Laws for Compute-Optimal Model DesignNeural Information Processing Systems (NeurIPS), 2023 |
The Quantization Model of Neural ScalingNeural Information Processing Systems (NeurIPS), 2023 |
Language Is Not All You Need: Aligning Perception with Language ModelsNeural Information Processing Systems (NeurIPS), 2023 |
Scaling Laws for Multilingual Neural Machine TranslationInternational Conference on Machine Learning (ICML), 2023 |
Scaling Vision Transformers to 22 Billion ParametersInternational Conference on Machine Learning (ICML), 2023 |
Scaling Laws for Generative Mixed-Modal Language ModelsInternational Conference on Machine Learning (ICML), 2023 |
Reproducible scaling laws for contrastive language-image learningComputer Vision and Pattern Recognition (CVPR), 2022 |
Broken Neural Scaling LawsInternational Conference on Learning Representations (ICLR), 2022 |
Scaling Laws for Reward Model OveroptimizationInternational Conference on Machine Learning (ICML), 2022 |
Scaling Laws for a Multi-Agent Reinforcement Learning ModelInternational Conference on Learning Representations (ICLR), 2022 |
Understanding Decoupled and Early Weight DecayAAAI Conference on Artificial Intelligence (AAAI), 2020 |
Language Models are Few-Shot LearnersNeural Information Processing Systems (NeurIPS), 2020 |
Attention Is All You NeedNeural Information Processing Systems (NeurIPS), 2017 |
Outrageously Large Neural Networks: The Sparsely-Gated
Mixture-of-Experts LayerInternational Conference on Learning Representations (ICLR), 2017 |