16

Opportunistic Expert Activation: Batch-Aware Expert Routing for Faster Decode Without Retraining

Costin-Andrei Oncescu
Qingyang Wu
Wai Tong Chung
Robert Wu
Bryan Gopal
Junxiong Wang
Tri Dao
Ben Athiwaratkun
Main:9 Pages
11 Figures
Bibliography:2 Pages
10 Tables
Appendix:7 Pages
Abstract

An increasing number of LLMs employ Mixture-of-Experts (MoE) architectures where the feed-forward layer is replaced by a pool of experts and each token only activates a small subset of them. During autoregressive generation, these models often enter a memory-bound regime even for moderate batch sizes because the average expert load grows more slowly than in an equivalent dense feedforward layer. Consequently, MoE latency is governed by the number of activated experts. We introduce a framework for dynamically re-routing token-to-expert mapping to lower this number (and thus, the decode latency) while preserving a comparable quality. Our best results use a batch-aware routing that works by having tokens piggyback experts that have already been loaded into memory due to being crucial to other tokens within the same batch. Empirically, we evaluate our method on the Qwen3-30B and Qwen3-235B models with a batch size of 1616. Without any statistically significant loss in accuracy, our approach achieves latency reductions of 39%39\% and 15%15\% in the MoE layer decode latency, respectively.

View on arXiv
Comments on this paper