Large language models (LLMs) have played a pivotal role in revolutionizing
various facets of our daily existence. Solving attention regression is a
fundamental task in optimizing LLMs. In this work, we focus on giving a
provable guarantee for the one-layer attention network objective function
L(X,Y)=∑j0=1n∑i0=1d(⟨⟨exp(Aj0x),1n⟩−1exp(Aj0x),A3Y∗,i0⟩−bj0,i0)2. Here A∈Rn2×d2 is Kronecker product between A1∈Rn×d and
A2∈Rn×d. A3 is a matrix in Rn×d, Aj0∈Rn×d2 is the j0-th block of
A. The X,Y∈Rd×d are variables we want to
learn. B∈Rn×d and bj0,i0∈R is one
entry at j0-th row and i0-th column of B, Y∗,i0∈Rd
is the i0-column vector of Y, and x∈Rd2 is the
vectorization of X.
In a multi-layer LLM network, the matrix B∈Rn×d can
be viewed as the output of a layer, and A1=A2=A3∈Rn×d can be viewed as the input of a layer. The matrix version of x can
be viewed as QK⊤ and Y can be viewed as V. We provide an iterative
greedy algorithm to train loss function L(X,Y) up ϵ that runs in
O((Tmat(n,n,d)+Tmat(n,d,d)+d2ω)log(1/ϵ)) time. Here Tmat(a,b,c) denotes the time of multiplying a×b matrix
another b×c matrix, and ω≈2.37 denotes the exponent of
matrix multiplication.