ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2310.20051
14
5

The Expressibility of Polynomial based Attention Scheme

30 October 2023
Zhao-quan Song
Guangyi Xu
Junze Yin
ArXivPDFHTML
Abstract

Large language models (LLMs) have significantly improved various aspects of our daily lives. These models have impacted numerous domains, from healthcare to education, enhancing productivity, decision-making processes, and accessibility. As a result, they have influenced and, to some extent, reshaped people's lifestyles. However, the quadratic complexity of attention in transformer architectures poses a challenge when scaling up these models for processing long textual contexts. This issue makes it impractical to train very large models on lengthy texts or use them efficiently during inference. While a recent study by [KMZ23] introduced a technique that replaces the softmax with a polynomial function and polynomial sketching to speed up attention mechanisms, the theoretical understandings of this new approach are not yet well understood. In this paper, we offer a theoretical analysis of the expressive capabilities of polynomial attention. Our study reveals a disparity in the ability of high-degree and low-degree polynomial attention. Specifically, we construct two carefully designed datasets, namely D0\mathcal{D}_0D0​ and D1\mathcal{D}_1D1​, where D1\mathcal{D}_1D1​ includes a feature with a significantly larger value compared to D0\mathcal{D}_0D0​. We demonstrate that with a sufficiently high degree β\betaβ, a single-layer polynomial attention network can distinguish between D0\mathcal{D}_0D0​ and D1\mathcal{D}_1D1​. However, with a low degree β\betaβ, the network cannot effectively separate the two datasets. This analysis underscores the greater effectiveness of high-degree polynomials in amplifying large values and distinguishing between datasets. Our analysis offers insight into the representational capacity of polynomial attention and provides a rationale for incorporating higher-degree polynomials in attention mechanisms to capture intricate linguistic correlations.

View on arXiv
Comments on this paper