546

Memorization Capacity of Multi-Head Attention in Transformers

International Conference on Learning Representations (ICLR), 2023
Abstract

In this paper, we investigate the memorization capabilities of multi-head attention in Transformers, motivated by the central role attention plays in these models. Under a mild linear independence assumption on the input data, we present a theoretical analysis demonstrating that an HH-head attention layer with a context size nn, dimension dd, and O(Hd2)O(Hd^2) parameters can memorize O(Hn)O(Hn) examples. We conduct experiments that verify our assumptions on the image classification task using Vision Transformer. To validate our theoretical findings, we perform synthetic experiments and show a linear relationship between memorization capacity and the number of attention heads.

View on arXiv
Comments on this paper