96

SAGA: Selective Adaptive Gating for Efficient and Expressive Linear Attention

Main:7 Pages
5 Figures
Bibliography:2 Pages
4 Tables
Abstract

While Transformer architecture excel at modeling long-range dependencies contributing to its widespread adoption in vision tasks the quadratic complexity of softmax-based attention mechanisms imposes a major bottleneck, particularly when processing high-resolution images. Linear attention presents a promising alternative by reformulating the attention computation from (QK)V(QK)V to Q(KV)Q(KV), thereby reducing the complexity from O(N2)\mathcal{O}(N^2) to O(N)\mathcal{O}(N) while preserving the global receptive field. However, most existing methods compress historical key-value (KV) information uniformly, which can lead to feature redundancy and the loss of directional alignment with the query (Q). This uniform compression results in low-rank KVKV feature maps, contributing to a performance gap compared to softmax attention. To mitigate this limitation, we propose \textbf{S}elective \textbf{A}daptive \textbf{GA}ting for Efficient and Expressive Linear Attention (SAGA) , which introduces input-adaptive learnable gates to selectively modulate information aggregation into the KVKV feature map. These gates enhance semantic diversity and alleviate the low-rank constraint inherent in conventional linear attention. Additionally, we propose an efficient Hadamard-product decomposition method for gate computation, which introduces no additional memory overhead. Experiments demonstrate that SAGA achieves a 1.76×\times improvement in throughput and a 2.69×\times reduction in peak GPU memory compared to PVT-T at a resolution of 1280×12801280 \times 1280. Moreover, it improves top-1 accuracy by up to 4.4\% on the ImageNet dataset, demonstrating both computational efficiency and model effectiveness.

View on arXiv
Comments on this paper