MoE-Infinity: Efficient MoE Inference on Personal Machines with Sparsity-Aware Expert Cache

MoE-Infinity: Efficient MoE Inference on Personal Machines with Sparsity-Aware Expert Cache

25 January 2024

Papers citing "MoE-Infinity: Efficient MoE Inference on Personal Machines with Sparsity-Aware Expert Cache"

9 / 9 papers shown

Title
FloE: On-the-Fly MoE Inference on Memory-constrained GPU Yuxin Zhou Zheng Li J. Zhang Jue Wang Y. Wang Zhongle Xie Ke Chen Lidan Shou MoE 43 0 0 09 May 2025
eMoE: Task-aware Memory Efficient Mixture-of-Experts-Based (MoE) Model Inference Suraiya Tairin Shohaib Mahmud Haiying Shen Anand Iyer MoE 111 0 0 10 Mar 2025
iServe: An Intent-based Serving System for LLMs Dimitrios Liakopoulos Tianrui Hu Prasoon Sinha N. Yadwadkar VLM 119 0 0 08 Jan 2025
DAOP: Data-Aware Offloading and Predictive Pre-Calculation for Efficient MoE Inference Yujie Zhang Shivam Aggarwal T. Mitra MoE 72 0 0 16 Dec 2024
Fiddler: CPU-GPU Orchestration for Fast Inference of Mixture-of-Experts Models Keisuke Kamahori Tian Tang Yile Gu Kan Zhu Baris Kasikci 61 20 0 10 Feb 2024
Mixture of Quantized Experts (MoQE): Complementary Effect of Low-bit Quantization and Robustness Young Jin Kim Raffy Fahim Hany Awadalla MQ MoE 56 19 0 03 Oct 2023
FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU Ying Sheng Lianmin Zheng Binhang Yuan Zhuohan Li Max Ryabinin ... Joseph E. Gonzalez Percy Liang Christopher Ré Ion Stoica Ce Zhang 144 366 0 13 Mar 2023
ZeRO-Offload: Democratizing Billion-Scale Model Training Jie Ren Samyam Rajbhandari Reza Yazdani Aminabadi Olatunji Ruwase Shuangyang Yang Minjia Zhang Dong Li Yuxiong He MoE 160 413 0 18 Jan 2021
Serverless in the Wild: Characterizing and Optimizing the Serverless Workload at a Large Cloud Provider Mohammad Shahrad Rodrigo Fonseca Íñigo Goiri G. Chaudhry Paul Batum Jason Cooke Eduardo Laureano Colby Tresness M. Russinovich Ricardo Bianchini 79 601 0 06 Mar 2020