ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2302.13214
89
86
v1v2 (latest)

Fast Attention Requires Bounded Entries

26 February 2023
Josh Alman
Zhao Song
ArXiv (abs)PDFHTML
Abstract

In modern machine learning, inner product attention computation is a fundamental task for training large language models such as Transformer, GPT-1, BERT, GPT-2, GPT-3 and ChatGPT. Formally, in this problem, one is given as input three matrices Q,K,V∈[−B,B]n×dQ, K, V \in [-B,B]^{n \times d}Q,K,V∈[−B,B]n×d, and the goal is to construct the matrix Att(Q,K,V):=diag(A1n)−1AV∈Rn×d\mathrm{Att}(Q,K,V) := \mathrm{diag}(A {\bf 1}_n)^{-1} A V \in \mathbb{R}^{n \times d}Att(Q,K,V):=diag(A1n​)−1AV∈Rn×d, where A=exp⁡(QK⊤/d)A = \exp(QK^\top/d)A=exp(QK⊤/d) is the `attention matrix', and exp⁡\expexp is applied entry-wise. Straightforward methods for this problem explicitly compute the n×nn \times nn×n attention matrix AAA, and hence require time Ω(n2)\Omega(n^2)Ω(n2) even when d=no(1)d = n^{o(1)}d=no(1) is small. In this paper, we investigate whether faster algorithms are possible by implicitly making use of the matrix AAA. We present two results, showing that there is a sharp transition at B=Θ(log⁡n)B = \Theta(\sqrt{\log n})B=Θ(logn​). ∙\bullet∙ If d=O(log⁡n)d = O(\log n)d=O(logn) and B=o(log⁡n)B = o(\sqrt{\log n})B=o(logn​), there is an n1+o(1)n^{1+o(1)}n1+o(1) time algorithm to approximate Att(Q,K,V)\mathrm{Att}(Q,K,V)Att(Q,K,V) up to 1/poly(n)1/\mathrm{poly}(n)1/poly(n) additive error. ∙\bullet∙ If d=O(log⁡n)d = O(\log n)d=O(logn) and B=Θ(log⁡n)B = \Theta (\sqrt{\log n})B=Θ(logn​), assuming the Strong Exponential Time Hypothesis from fine-grained complexity theory, it is impossible to approximate Att(Q,K,V)\mathrm{Att}(Q,K,V)Att(Q,K,V) up to 1/poly(n)1/\mathrm{poly}(n)1/poly(n) additive error in truly subquadratic time n2−Ω(1)n^{2 - \Omega(1)}n2−Ω(1). This gives a theoretical explanation for the phenomenon observed in practice that attention computation is much more efficient when the input matrices have smaller entries.

View on arXiv
Comments on this paper