Semantic Equitable Clustering: A Simple and Effective Strategy for Clustering Vision Tokens

22 May 2024

Qihang Fan

Main:11 Pages

5 Figures

Bibliography:2 Pages

14 Tables

Abstract

The Vision Transformer (ViT) has gained prominence for its superior relational modeling prowess. However, its global attention mechanism's quadratic complexity poses substantial computational burdens. A common remedy spatially groups tokens for self-attention, reducing computational requirements. Nonetheless, this strategy neglects semantic information in tokens, possibly scattering semantically-linked tokens across distinct groups, thus compromising the efficacy of self-attention intended for modeling inter-token dependencies. Motivated by these insights, we introduce a fast and balanced clustering method, named

View on arXiv

Comments on this paper