Mirage Persistent Kernel: A Compiler and Runtime for Mega-Kernelizing Tensor Programs

22 December 2025

Xinhao Cheng

Zhihao Zhang

Yu Zhou

Jianan Ji

Jinchen Jiang

Zepeng Zhao

Ziruo Xiao

Zihao Ye

Yingyi Huang

Ruihang Lai

Hongyi Jin

Bohan Hou

Mengdi Wu

Yixin Dong

Anthony Yip

Zihao Ye

Songting Wang

Wenqin Yang

Xupeng Miao

Tianqi Chen

Zhihao Jia

MoE

VLM

ArXiv (abs)PDF HTML Github (9028★)

Main:12 Pages

16 Figures

Bibliography:3 Pages

1 Tables

Abstract

We introduce Mirage Persistent Kernel (MPK), the first compiler and runtime system that automatically transforms multi-GPU model inference into a single high-performance megakernel. MPK introduces an SM-level graph representation that captures data dependencies at the granularity of individual streaming multiprocessors (SMs), enabling cross-operator software pipelining, fine-grained kernel overlap, and other previously infeasible GPU optimizations. The MPK compiler lowers tensor programs into highly optimized SM-level task graphs and generates optimized CUDA implementations for all tasks, while the MPK in-kernel parallel runtime executes these tasks within a single mega-kernel using decentralized scheduling across SMs. Together, these components provide end-to-end kernel fusion with minimal developer effort, while preserving the flexibility of existing programming models. Our evaluation shows that MPK significantly outperforms existing kernel-per-operator LLM serving systems by reducing end-to-end inference latency by up to 1.7x, pushing LLM inference performance close to hardware limits. MPK is publicly available atthis https URL.

View on arXiv

Comments on this paper