Semantic Equitable Clustering: A Simple and Effective Strategy for Clustering Vision Tokens
Main:11 Pages
5 Figures
Bibliography:2 Pages
14 Tables
Abstract
The Vision Transformer (ViT) has gained prominence for its superior relational modeling prowess. However, its global attention mechanism's quadratic complexity poses substantial computational burdens. A common remedy spatially groups tokens for self-attention, reducing computational requirements. Nonetheless, this strategy neglects semantic information in tokens, possibly scattering semantically-linked tokens across distinct groups, thus compromising the efficacy of self-attention intended for modeling inter-token dependencies. Motivated by these insights, we introduce a fast and balanced clustering method, named
View on arXivComments on this paper
