ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2402.04084
18
14

Provably learning a multi-head attention layer

6 February 2024
Sitan Chen
Yuanzhi Li
    MLT
ArXivPDFHTML
Abstract

The multi-head attention layer is one of the key components of the transformer architecture that sets it apart from traditional feed-forward models. Given a sequence length kkk, attention matrices Θ1,…,Θm∈Rd×d\mathbf{\Theta}_1,\ldots,\mathbf{\Theta}_m\in\mathbb{R}^{d\times d}Θ1​,…,Θm​∈Rd×d, and projection matrices W1,…,Wm∈Rd×d\mathbf{W}_1,\ldots,\mathbf{W}_m\in\mathbb{R}^{d\times d}W1​,…,Wm​∈Rd×d, the corresponding multi-head attention layer F:Rk×d→Rk×dF: \mathbb{R}^{k\times d}\to \mathbb{R}^{k\times d}F:Rk×d→Rk×d transforms length-kkk sequences of ddd-dimensional tokens X∈Rk×d\mathbf{X}\in\mathbb{R}^{k\times d}X∈Rk×d via F(X)≜∑i=1msoftmax(XΘiX⊤)XWiF(\mathbf{X}) \triangleq \sum^m_{i=1} \mathrm{softmax}(\mathbf{X}\mathbf{\Theta}_i\mathbf{X}^\top)\mathbf{X}\mathbf{W}_iF(X)≜∑i=1m​softmax(XΘi​X⊤)XWi​. In this work, we initiate the study of provably learning a multi-head attention layer from random examples and give the first nontrivial upper and lower bounds for this problem: - Provided {Wi,Θi}\{\mathbf{W}_i, \mathbf{\Theta}_i\}{Wi​,Θi​} satisfy certain non-degeneracy conditions, we give a (dk)O(m3)(dk)^{O(m^3)}(dk)O(m3)-time algorithm that learns FFF to small error given random labeled examples drawn uniformly from {±1}k×d\{\pm 1\}^{k\times d}{±1}k×d. - We prove computational lower bounds showing that in the worst case, exponential dependence on mmm is unavoidable. We focus on Boolean X\mathbf{X}X to mimic the discrete nature of tokens in large language models, though our techniques naturally extend to standard continuous settings, e.g. Gaussian. Our algorithm, which is centered around using examples to sculpt a convex body containing the unknown parameters, is a significant departure from existing provable algorithms for learning feedforward networks, which predominantly exploit algebraic and rotation invariance properties of the Gaussian distribution. In contrast, our analysis is more flexible as it primarily relies on various upper and lower tail bounds for the input distribution and "slices" thereof.

View on arXiv
Comments on this paper