v1v2 (latest)

NOSA: Native and Offloadable Sparse Attention

15 October 2025

Zhou Su

Ao Sun

Hongya Lyu

Hengyu Zhao

Yudong Wang

Chaojun Xiao

Xu Han

Zhiyuan Liu

ArXiv (abs)PDF HTML HuggingFace (4 upvotes)Github (1★)

Main:13 Pages

10 Figures

Bibliography:6 Pages

10 Tables

Appendix:9 Pages

Abstract

Decoding throughput improvements from larger inference batches are limited by GPU memory, which is largely consumed by the key-value (KV) cache. Prior training-free KV cache offloading alleviates this by keeping redundant context on the CPU and fetching only a sparse subset for attention, but it often degrades long-generation quality due to training-inference mismatch on sparse patterns. Meanwhile, trainable sparse attention is incompatible with efficient offloading, as unconstrained KV accesses may force large CPU-to-GPU transfers and erase throughput gains. To this end, we propose NOSA, a trainable sparse attention mechanism natively designed for KV cache offloading. NOSA explicitly constrains the volume of CPU-GPU KV transfers, thereby achieving low communication overhead and high decoding throughput. We further build NOSI, a KV cache offloading inference system that fully unlocks NOSA's efficiency. Empirical results on 1,3,8B LLMs demonstrate that NOSA outperforms KV cache offloading baselines on general, long-input, and long-generation tasks, while boosting decoding throughput by up to 5.04x, 1.92x, and 1.83x over FullAttn, InfLLMv2, and ShadowKV, respectively. We release our code atthis https URL.

View on arXiv

Comments on this paper