21

PAT: Accelerating LLM Decoding via Prefix-Aware Attention with Resource Efficient Multi-Tile Kernel

Jinjun Yi
Zhixin Zhao
Yitao Hu
Ke Yan
Weiwei Sun
Hao Wang
Laiping Zhao
Yuhao Zhang
Wenxin Li
Keqiu Li
Main:13 Pages
16 Figures
Bibliography:3 Pages
2 Tables
Abstract

LLM serving is increasingly dominated by decode attention, which is a memory-bound operation due to massive KV cache loading from global memory. Meanwhile, real-world workloads exhibit substantial, hierarchical shared prefixes across requests (e.g., system prompts, tools/templates, RAG). Existing attention implementations fail to fully exploit prefix sharing: *one-query-per-CTA* execution repeatedly loads shared prefix KV cache, while *one-size-fits-all* tiling leaves on-chip resources idle and exacerbates bubbles for uneven KV lengths. These choices amplify memory bandwidth pressure and stall memory-bound decode attention.

View on arXiv
Comments on this paper