PAT: Accelerating LLM Decoding via Prefix-Aware Attention with Resource Efficient Multi-Tile Kernel

27 November 2025

Jinjun Yi

Zhixin Zhao

Yitao Hu

Ke Yan

Weiwei Sun

Hao Wang

Laiping Zhao

Yuhao Zhang

Wenxin Li

Keqiu Li

ArXiv (abs)PDF HTML Github (15★)

Main:13 Pages

16 Figures

Bibliography:3 Pages

2 Tables

Abstract

LLM serving is increasingly dominated by decode attention, which is a memory-bound operation due to massive KV cache loading from global memory. Meanwhile, real-world workloads exhibit substantial, hierarchical shared prefixes across requests (e.g., system prompts, tools/templates, RAG). Existing attention implementations fail to fully exploit prefix sharing: *one-query-per-CTA* execution repeatedly loads shared prefix KV cache, while *one-size-fits-all* tiling leaves on-chip resources idle and exacerbates bubbles for uneven KV lengths. These choices amplify memory bandwidth pressure and stall memory-bound decode attention.

View on arXiv

Comments on this paper