Vision Transformers (ViTs) excel in semantic segmentation but demand significant computation, posing challenges for deployment on resource-constrained devices. Existing token pruning methods often overlook fundamental visual data characteristics. This study introduces 'LVTP', a progressive token pruning framework guided by multi-scale Tsallis entropy and low-level visual features with twice clustering. It integrates high-level semantics and basic visual attributes for precise segmentation. A novel dynamic scoring mechanism using multi-scale Tsallis entropy weighting overcomes limitations of traditional single-parameter entropy. The framework also incorporates low-level feature analysis to preserve critical edge information while optimizing computational cost. As a plug-and-play module, it requires no architectural changes or additional training. Evaluations across multiple datasets show 20%-45% computational reductions with negligible performance loss, outperforming existing methods in balancing cost and accuracy, especially in complex edge regions.
View on arXiv@article{ouyang2025_2504.17996, title={ Back to Fundamentals: Low-Level Visual Features Guided Progressive Token Pruning }, author={ Yuanbing Ouyang and Yizhuo Liang and Qingpeng Li and Xinfei Guo and Yiming Luo and Di Wu and Hao Wang and Yushan Pan }, journal={arXiv preprint arXiv:2504.17996}, year={ 2025 } }