Gating Dropout: Communication-efficient Regularization for Sparsely
Activated Transformers

Gating Dropout: Communication-efficient Regularization for Sparsely Activated Transformers

28 May 2022

Young Jin Kim

Alexandre Muzio

Papers citing "Gating Dropout: Communication-efficient Regularization for Sparsely Activated Transformers"

19 / 19 papers shown

Title
Importance Sampling via Score-based Generative Models Heasung Kim Taekyun Lee Hyeji Kim Gustavo de Veciana MedIm DiffM 129 1 0 07 Feb 2025
Communication-Efficient Sparsely-Activated Model Training via Sequence Migration and Token Condensation Fahao Chen Peng Li Zicong Hong Zhou Su Song Guo MoMe MoE 67 0 0 23 Nov 2024
Exploring the Benefit of Activation Sparsity in Pre-training Zhengyan Zhang Chaojun Xiao Qiujieli Qin Yankai Lin Zhiyuan Zeng Xu Han Zhiyuan Liu Ruobing Xie Maosong Sun Jie Zhou MoE 58 3 0 04 Oct 2024
Parm: Efficient Training of Large Sparsely-Activated Models with Dedicated Schedules Xinglin Pan Wenxiang Lin S. Shi Xiaowen Chu Weinong Sun Bo Li MoE 41 3 0 30 Jun 2024
Frequency Decoupling for Motion Magnification via Multi-Level Isomorphic Architecture Fei Wang Dan Guo Kun Li Zhun Zhong Mengqing Wang 34 16 0 12 Mar 2024
Model Compression and Efficient Inference for Large Language Models: A Survey Wenxiao Wang Wei Chen Yicong Luo Yongliu Long Zhengkai Lin Liye Zhang Binbin Lin Deng Cai Xiaofei He MQ 36 46 0 15 Feb 2024
Mixture of Quantized Experts (MoQE): Complementary Effect of Low-bit Quantization and Robustness Young Jin Kim Raffy Fahim Hany Awadalla MQ MoE 58 19 0 03 Oct 2023
Task-Based MoE for Multitask Multilingual Machine Translation Hai Pham Young Jin Kim Subhabrata Mukherjee David P. Woodruff Barnabás Póczós Hany Awadalla MoE 28 4 0 30 Aug 2023
FineQuant: Unlocking Efficiency with Fine-Grained Weight-Only Quantization for LLMs Young Jin Kim Rawn Henry Raffy Fahim Hany Awadalla MQ 23 19 0 16 Aug 2023
Soft Merging of Experts with Adaptive Routing Mohammed Muqeeth Haokun Liu Colin Raffel MoMe MoE 24 45 0 06 Jun 2023
Towards Being Parameter-Efficient: A Stratified Sparsely Activated Transformer with Dynamic Capacity Da Xu Maha Elbayad Kenton W. Murray Jean Maillard Vedanuj Goswami MoE 39 3 0 03 May 2023
Fixing MoE Over-Fitting on Low-Resource Languages in Multilingual Machine Translation Maha Elbayad Anna Y. Sun Shruti Bhosale MoE 46 8 0 15 Dec 2022
Who Says Elephants Can't Run: Bringing Large Scale MoE Models into Cloud Scale Production Young Jin Kim Rawn Henry Raffy Fahim Hany Awadalla MoE 21 23 0 18 Nov 2022
AutoMoE: Heterogeneous Mixture-of-Experts with Adaptive Computation for Efficient Neural Machine Translation Ganesh Jawahar Subhabrata Mukherjee Xiaodong Liu Young Jin Kim Muhammad Abdul-Mageed L. Lakshmanan Ahmed Hassan Awadallah Sébastien Bubeck Jianfeng Gao MoE 22 5 0 14 Oct 2022
MoEC: Mixture of Expert Clusters Yuan Xie Shaohan Huang Tianyu Chen Furu Wei MoE 40 11 0 19 Jul 2022
Transformer with Memory Replay R. Liu Barzan Mozafari OffRL 62 4 0 19 May 2022
Scalable and Efficient MoE Training for Multitask Multilingual Models Young Jin Kim A. A. Awan Alexandre Muzio Andres Felipe Cruz Salinas Liyang Lu Amr Hendy Samyam Rajbhandari Yuxiong He Hany Awadalla MoE 94 84 0 22 Sep 2021
Dropout: Explicit Forms and Capacity Control R. Arora Peter L. Bartlett Poorya Mianjy Nathan Srebro 50 37 0 06 Mar 2020
Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning Y. Gal Zoubin Ghahramani UQCV BDL 249 9,134 0 06 Jun 2015