SystolicAttention: Fusing FlashAttention within a Single Systolic Array
Main:11 Pages
15 Figures
Bibliography:4 Pages
4 Tables
Abstract
Transformer models rely heavily on scaled dot-product attention (SDPA), typically implemented using the FlashAttention algorithm. However, current systolic-array-based accelerators face significant challenges when executing FlashAttention. Systolic arrays can only achieve high utilization for consecutive and large matrix multiplications. In contrast, FlashAttention requires frequently interleaved matrix multiplications and softmax operations.
View on arXivComments on this paper
