ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2106.00515
21
105

KVT: k-NN Attention for Boosting Vision Transformers

28 May 2021
Pichao Wang
Xue Wang
F. Wang
Ming Lin
Shuning Chang
Hao Li
R. L. Jin
    ViT
ArXivPDFHTML
Abstract

Convolutional Neural Networks (CNNs) have dominated computer vision for years, due to its ability in capturing locality and translation invariance. Recently, many vision transformer architectures have been proposed and they show promising performance. A key component in vision transformers is the fully-connected self-attention which is more powerful than CNNs in modelling long range dependencies. However, since the current dense self-attention uses all image patches (tokens) to compute attention matrix, it may neglect locality of images patches and involve noisy tokens (e.g., clutter background and occlusion), leading to a slow training process and potential degradation of performance. To address these problems, we propose the kkk-NN attention for boosting vision transformers. Specifically, instead of involving all the tokens for attention matrix calculation, we only select the top-kkk similar tokens from the keys for each query to compute the attention map. The proposed kkk-NN attention naturally inherits the local bias of CNNs without introducing convolutional operations, as nearby tokens tend to be more similar than others. In addition, the kkk-NN attention allows for the exploration of long range correlation and at the same time filters out irrelevant tokens by choosing the most similar tokens from the entire image. Despite its simplicity, we verify, both theoretically and empirically, that kkk-NN attention is powerful in speeding up training and distilling noise from input tokens. Extensive experiments are conducted by using 11 different vision transformer architectures to verify that the proposed kkk-NN attention can work with any existing transformer architectures to improve its prediction performance. The codes are available at \url{https://github.com/damo-cv/KVT}.

View on arXiv
Comments on this paper