Memorization Capacity of Multi-Head Attention in Transformers

International Conference on Learning Representations (ICLR), 2023

3 June 2023

Abstract

In this paper, we investigate the memorization capabilities of multi-head attention in Transformers, motivated by the central role attention plays in these models. Under a mild linear independence assumption on the input data, we present a theoretical analysis demonstrating that an $H$ -head attention layer with a context size $n$ , dimension $d$ , and $O(Hd^2)$ parameters can memorize $O(Hn)$ examples. We conduct experiments that verify our assumptions on the image classification task using Vision Transformer. To validate our theoretical findings, we perform synthetic experiments and show a linear relationship between memorization capacity and the number of attention heads.

View on arXiv

Comments on this paper