382

Semantic Equitable Clustering: A Simple and Effective Strategy for Clustering Vision Tokens

Main:11 Pages
5 Figures
Bibliography:2 Pages
14 Tables
Abstract

The Vision Transformer (ViT) has gained prominence for its superior relational modeling prowess. However, its global attention mechanism's quadratic complexity poses substantial computational burdens. A common remedy spatially groups tokens for self-attention, reducing computational requirements. Nonetheless, this strategy neglects semantic information in tokens, possibly scattering semantically-linked tokens across distinct groups, thus compromising the efficacy of self-attention intended for modeling inter-token dependencies. Motivated by these insights, we introduce a fast and balanced clustering method, named

View on arXiv
Comments on this paper