65
1

Exact Sequence Classification with Hardmax Transformers

Abstract

We prove that hardmax attention transformers perfectly classify datasets of NN labeled sequences in Rd\mathbb{R}^d, d2d\geq 2. Specifically, given NN sequences with an arbitrary but finite length in Rd\mathbb{R}^d, we construct a transformer with O(N)\mathcal{O}(N) blocks and O(Nd)\mathcal{O}(Nd) parameters perfectly classifying this dataset. Our construction achieves the best complexity estimate to date, independent of the length of the sequences, by innovatively alternating feed-forward and self-attention layers and by capitalizing on the clustering effect inherent to the latter. Our novel constructive method also uses low-rank parameter matrices within the attention mechanism, a common practice in real-life transformer implementations. Consequently, our analysis holds twofold significance: it substantially advances the mathematical theory of transformers and it rigorously justifies their exceptional real-world performance in sequence classification tasks.

View on arXiv
@article{alcalde2025_2502.02270,
  title={ Exact Sequence Classification with Hardmax Transformers },
  author={ Albert Alcalde and Giovanni Fantuzzi and Enrique Zuazua },
  journal={arXiv preprint arXiv:2502.02270},
  year={ 2025 }
}
Comments on this paper