Memorization Capacity of Multi-Head Attention in Transformers
International Conference on Learning Representations (ICLR), 2023
Abstract
In this paper, we investigate the memorization capabilities of multi-head attention in Transformers, motivated by the central role attention plays in these models. Under a mild linear independence assumption on the input data, we present a theoretical analysis demonstrating that an -head attention layer with a context size , dimension , and parameters can memorize examples. We conduct experiments that verify our assumptions on the image classification task using Vision Transformer. To validate our theoretical findings, we perform synthetic experiments and show a linear relationship between memorization capacity and the number of attention heads.
View on arXivComments on this paper
