310

SystolicAttention: Fusing FlashAttention within a Single Systolic Array

Main:11 Pages
15 Figures
Bibliography:4 Pages
4 Tables
Abstract

Transformer models rely heavily on scaled dot-product attention (SDPA), typically implemented using the FlashAttention algorithm. However, current systolic-array-based accelerators face significant challenges when executing FlashAttention. Systolic arrays can only achieve high utilization for consecutive and large matrix multiplications. In contrast, FlashAttention requires frequently interleaved matrix multiplications and softmax operations.

View on arXiv
Comments on this paper