Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2402.07033
Cited By
Fiddler: CPU-GPU Orchestration for Fast Inference of Mixture-of-Experts Models
10 February 2024
Keisuke Kamahori
Tian Tang
Yile Gu
Kan Zhu
Baris Kasikci
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Fiddler: CPU-GPU Orchestration for Fast Inference of Mixture-of-Experts Models"
18 / 18 papers shown
Title
FloE: On-the-Fly MoE Inference on Memory-constrained GPU
Yuxin Zhou
Zheng Li
J. Zhang
Jue Wang
Y. Wang
Zhongle Xie
Ke Chen
Lidan Shou
MoE
23
0
0
09 May 2025
D
2
^{2}
2
MoE: Dual Routing and Dynamic Scheduling for Efficient On-Device MoE-based LLM Serving
Haodong Wang
Qihua Zhou
Zicong Hong
Song Guo
MoE
37
0
0
17 Apr 2025
MoE-Lens: Towards the Hardware Limit of High-Throughput MoE LLM Serving Under Resource Constraints
Yichao Yuan
Lin Ma
Nishil Talati
MoE
49
0
0
12 Apr 2025
HybriMoE: Hybrid CPU-GPU Scheduling and Cache Management for Efficient MoE Inference
Shuzhang Zhong
Y. Sun
Ling Liang
Runsheng Wang
R. Huang
Meng Li
MoE
47
0
0
08 Apr 2025
MoE-Gen: High-Throughput MoE Inference on a Single GPU with Module-Based Batching
Tairan Xu
Leyang Xue
Zhan Lu
Adrian Jackson
Luo Mai
MoE
74
1
0
12 Mar 2025
Seesaw: High-throughput LLM Inference via Model Re-sharding
Qidong Su
Wei Zhao
X. Li
Muralidhar Andoorveedu
Chenhao Jiang
Zhanda Zhu
Kevin Song
Christina Giannoula
Gennady Pekhimenko
LRM
65
0
0
09 Mar 2025
CoServe: Efficient Collaboration-of-Experts (CoE) Model Inference with Limited Memory
Jiashun Suo
Xiaojian Liao
Limin Xiao
Li Ruan
Jinquan Wang
Xiao Su
Zhisheng Huo
50
0
0
04 Mar 2025
Deploying Foundation Model Powered Agent Services: A Survey
Wenchao Xu
Jinyu Chen
Peirong Zheng
Xiaoquan Yi
Tianyi Tian
...
Quan Wan
Haozhao Wang
Yunfeng Fan
Qinliang Su
Xuemin Shen
AI4CE
104
1
0
18 Dec 2024
DAOP: Data-Aware Offloading and Predictive Pre-Calculation for Efficient MoE Inference
Yujie Zhang
Shivam Aggarwal
T. Mitra
MoE
53
0
0
16 Dec 2024
BlendServe: Optimizing Offline Inference for Auto-regressive Large Models with Resource-aware Batching
Yilong Zhao
Shuo Yang
Kan Zhu
Lianmin Zheng
Baris Kasikci
Yang Zhou
Jiarong Xing
Ion Stoica
93
5
0
25 Nov 2024
HOBBIT: A Mixed Precision Expert Offloading System for Fast MoE Inference
Peng Tang
Jiacheng Liu
X. Hou
Yifei Pu
Jing Wang
Pheng-Ann Heng
C. Li
M. Guo
MoE
42
6
0
03 Nov 2024
ExpertFlow: Optimized Expert Activation and Token Allocation for Efficient Mixture-of-Experts Inference
Xin He
Shunkang Zhang
Yuxin Wang
Haiyan Yin
Zihao Zeng
Shaohuai Shi
Zhenheng Tang
Xiaowen Chu
Ivor Tsang
Ong Yew Soon
MoE
46
3
0
23 Oct 2024
Mixture of Experts with Mixture of Precisions for Tuning Quality of Service
HamidReza Imani
Abdolah Amirany
Tarek A. El-Ghazawi
MoE
38
4
0
19 Jul 2024
LLM Inference Serving: Survey of Recent Advances and Opportunities
Baolin Li
Yankai Jiang
V. Gadepally
Devesh Tiwari
62
15
0
17 Jul 2024
PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU
Yixin Song
Zeyu Mi
Haotong Xie
Haibo Chen
BDL
99
114
0
16 Dec 2023
LLM in a flash: Efficient Large Language Model Inference with Limited Memory
Keivan Alizadeh-Vahid
Iman Mirzadeh
Dmitry Belenko
Karen Khatamifard
Minsik Cho
C. C. D. Mundo
Mohammad Rastegari
Mehrdad Farajtabar
63
104
0
12 Dec 2023
ReLU Strikes Back: Exploiting Activation Sparsity in Large Language Models
Iman Mirzadeh
Keivan Alizadeh-Vahid
Sachin Mehta
C. C. D. Mundo
Oncel Tuzel
Golnoosh Samei
Mohammad Rastegari
Mehrdad Farajtabar
116
58
0
06 Oct 2023
FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU
Ying Sheng
Lianmin Zheng
Binhang Yuan
Zhuohan Li
Max Ryabinin
...
Joseph E. Gonzalez
Percy Liang
Christopher Ré
Ion Stoica
Ce Zhang
135
208
0
13 Mar 2023
1