ResearchTrend.AI
  • Papers
  • Communities
  • Organizations
  • Events
  • Blog
  • Pricing
  • Feedback
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2311.14652
123
3
v1v2 (latest)

One Pass Streaming Algorithm for Super Long Token Attention Approximation in Sublinear Space

24 November 2023
Raghav Addanki
Chenyang Li
Zhao Song
Chiwun Yang
ArXiv (abs)PDFHTML
Abstract

Deploying Large Language Models (LLMs) in streaming applications that involve long contexts, particularly for extended dialogues and text analysis, is of paramount importance but presents two significant challenges. Firstly, the memory consumption is substantial during the decoding phase due to the caching of Key and Value states (KV) of previous tokens. Secondly, attention computation is time-consuming with a time complexity of O(n2)O(n^2)O(n2) for the generation of each token. In recent OpenAI DevDay (Nov 6, 2023), OpenAI released a new model that is able to support a 128K-long document, in our paper, we focus on the memory-efficient issue when context length nnn is much greater than 128K (n≫2dn \gg 2^dn≫2d). Considering a single-layer self-attention with Query, Key, and Value matrices Q,K,V∈Rn×dQ, K, V \in \mathbb{R}^{n \times d}Q,K,V∈Rn×d, the polynomial method approximates the attention output T∈Rn×dT \in \mathbb{R}^{n \times d}T∈Rn×d. It accomplishes this by constructing U1,U2∈Rn×tU_1, U_2 \in \mathbb{R}^{n \times t}U1​,U2​∈Rn×t to expedite attention Attn(Q,K,V){\sf Attn}(Q, K, V)Attn(Q,K,V) computation within n1+o(1)n^{1+o(1)}n1+o(1) time executions. Despite this, storing the Key and Value matrices K,V∈Rn×dK, V \in \mathbb{R}^{n \times d}K,V∈Rn×d still necessitates O(nd)O( n d)O(nd) space, leading to significant memory usage. In response to these challenges, we introduce a new algorithm that only reads one pass of the data in streaming fashion. This method employs sublinear space o(n)o(n)o(n) to store three sketch matrices, alleviating the need for exact K,VK, VK,V storage. Notably, our algorithm exhibits exceptional memory-efficient performance with super-long tokens. As the token length nnn increases, our error guarantee diminishes while the memory usage remains nearly constant. This unique attribute underscores the potential of our technique in efficiently handling LLMs in streaming applications.

View on arXiv
Comments on this paper