Exploiting Inter-Layer Expert Affinity for Accelerating
Mixture-of-Experts Model Inference

Exploiting Inter-Layer Expert Affinity for Accelerating Mixture-of-Experts Model Inference

16 January 2024

Jinghan Yao

Quentin G. Anthony

Hari Subramoni

Dhabaleswar K.

Papers citing "Exploiting Inter-Layer Expert Affinity for Accelerating Mixture-of-Experts Model Inference"

11 / 11 papers shown

Title
Accelerating MoE Model Inference with Expert Sharding Oana Balmau Anne-Marie Kermarrec Rafael Pires André Loureiro Espírito Santo M. Vos Milos Vujasinovic MoE 62 0 0 11 Mar 2025
A Comprehensive Survey of Mixture-of-Experts: Algorithms, Theory, and Applications Siyuan Mu Sen Lin MoE 118 1 0 10 Mar 2025
Speculative MoE: Communication Efficient Parallel MoE Inference with Speculative Token and Expert Pre-scheduling Yan Li Pengfei Zheng Shuang Chen Zewei Xu Yuanhao Lai Yunfei Du Z. Wang MoE 107 0 0 06 Mar 2025
CoServe: Efficient Collaboration-of-Experts (CoE) Model Inference with Limited Memory Jiashun Suo Xiaojian Liao Limin Xiao Li Ruan Jinquan Wang Xiao Su Zhisheng Huo 65 0 0 04 Mar 2025
MoETuner: Optimized Mixture of Expert Serving with Balanced Expert Placement and Token Routing Seokjin Go Divya Mahajan MoE 67 0 0 10 Feb 2025
LLM Inference Serving: Survey of Recent Advances and Opportunities Baolin Li Yankai Jiang V. Gadepally Devesh Tiwari 75 18 0 17 Jul 2024
Tutel: Adaptive Mixture-of-Experts at Scale Changho Hwang Wei Cui Yifan Xiong Ziyue Yang Ze Liu ... Joe Chau Peng Cheng Fan Yang Mao Yang Y. Xiong MoE 92 109 0 07 Jun 2022
Mixture-of-Experts with Expert Choice Routing Yan-Quan Zhou Tao Lei Han-Chu Liu Nan Du Yanping Huang Vincent Zhao Andrew M. Dai Zhifeng Chen Quoc V. Le James Laudon MoE 149 327 0 18 Feb 2022
Sparsity in Deep Learning: Pruning and growth for efficient inference and training in neural networks Torsten Hoefler Dan Alistarh Tal Ben-Nun Nikoli Dryden Alexandra Peste MQ 139 684 0 31 Jan 2021
The Pile: An 800GB Dataset of Diverse Text for Language Modeling Leo Gao Stella Biderman Sid Black Laurence Golding Travis Hoppe ... Horace He Anish Thite Noa Nabeshima Shawn Presser Connor Leahy AIMat 248 1,986 0 31 Dec 2020
Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism M. Shoeybi M. Patwary Raul Puri P. LeGresley Jared Casper Bryan Catanzaro MoE 243 1,817 0 17 Sep 2019