Efficient Attention: Attention with Linear Complexities
The attention mechanism has seen wide applications in computer vision and natural language processing. Recent works developed the dot-product attention mechanism and applied it to various vision and language tasks. However, the memory and computational costs of dot-product attention grows quadratically with the spatiotemporal size of the input. Such growth prohibits the application of the mechanism on large inputs, e.g., long sequences, high-resolution images, or large videos. To remedy this drawback, this paper proposes a novel efficient attention mechanism, which is equivalent to dot-product attention but has substantially less memory and computational costs. The resource efficiency allows more widespread and flexible incorporation of efficient attention modules into a neural network, which leads to improved accuracies. Empirical evaluations on object recognition and image classification demonstrated the effectiveness of its advantages. Models with efficient attention achieved state-of-the-art performance on MS-COCO 2017 and significant improvement on ImageNet. Further, the resource efficiency of the mechanism democratizes attention to complicated models, which were unable to incorporate original dot-product attention due to prohibitively high costs. As an exemplar, an efficient attention-augmented model achieved state-of-the-art accuracies for stereo depth estimation on the Scene Flow dataset.
View on arXiv