Efficient Online Inference of Vision Transformers by Training-Free Tokenization

23 November 2024

Leonidas Gee

ArXiv (abs)PDF HTML Github (2★)

Main:11 Pages

9 Figures

Bibliography:4 Pages

13 Tables

Appendix:6 Pages

Abstract

The cost of deploying vision transformers increasingly represents a barrier to wider industrial adoption. Existing compression techniques require additional end-to-end fine-tuning or incur a significant drawback to runtime, making them ill-suited for online (real-time) inference, where a prediction is made on any new input as it comes in. We introduce the $\textbf{Visual Word Tokenizer}$ (VWT), a training-free method for reducing energy costs while retaining performance and runtime. The VWT groups visual subwords (image patches) that are frequently used into visual words while infrequent ones remain intact. To do so, $\textit{intra}$ -image or $\textit{inter}$ -image statistics are leveraged to identify similar visual concepts for sequence compression. Experimentally, we demonstrate a reduction in wattage of up to 25% with only a 20% increase in runtime at most. Comparative approaches of 8-bit quantization and token merging achieve a lower or similar energy efficiency but exact a higher toll on runtime (up to 100% or more). Our results indicate that VWTs are well-suited for efficient online inference with a marginal compromise on performance.

View on arXiv

Comments on this paper