v1v2v3v4 (latest)

SWAA: Sliding Window Attention Adaptation for Efficient and Quality Preserving Long Context Processing

27 March 2026

Yijiong Yu

Jiale Liu

Qingyun Wu

Huazheng Wang

Ji Pei

ArXiv (abs)PDF HTML HuggingFace (17 upvotes)Github

Main:8 Pages

3 Figures

Bibliography:2 Pages

8 Tables

Appendix:4 Pages

Abstract

The quadratic complexity of self attention in Transformer based LLMs renders long context inference prohibitively expensive. While Sliding Window Attention (SWA), the simplest sparse attention pattern, offers a linear complexity alternative, it suffers from catastrophic long context performance collapse, which stems from two fundamental factors: the training inference mismatch when naively applying SWA to models pretrained with Full Attention (FA), and the inherent structural inability to access distant information when applying SWA to every module at all times. To address these dual challenges, we propose Sliding Window Attention Adaptation (SWAA), a plug and play toolkit of recipes that adapts FA models to SWA without costly pretraining. SWAA systematically combines four core strategies to tackle these distinct issues: (1) Full Attention (FA) Decode and (2) Interleaving FA and SWA layers, which mitigate structural defects by selectively allowing access to distant information; alongside (3) preserving ``sink'' tokens and (4) lightweight fine tuning, which mitigate the training inference mismatch. Our experiments reveal that while isolated strategies are insufficient, specific synergistic combinations effectively recover long context performance. Despite varying computational overheads, our performance efficiency trade off analysis identifies optimal SWAA configurations for diverse scenarios, achieving 30% to 100% speedups for long context inference with acceptable quality retention. Our code, data and model weights are available atthis https URL

View on arXiv

Comments on this paper