Dynamic Vision Mamba

7 April 2025

Abstract

Mamba-based vision models have gained extensive attention as a result of being computationally more efficient than attention-based models. However, spatial redundancy still exists in these models, represented by token and block redundancy. For token redundancy, we analytically find that early token pruning methods will result in inconsistency between training and inference or introduce extra computation for inference. Therefore, we customize token pruning to fit the Mamba structure by rearranging the pruned sequence before feeding it into the next Mamba block. For block redundancy, we allow each image to select SSM blocks dynamically based on an empirical observation that the inference speed of Mamba-based vision models is largely affected by the number of SSM blocks. Our proposed method, Dynamic Vision Mamba (DyVM), effectively reduces FLOPs with minor performance drops. We achieve a reduction of 35.2\% FLOPs with only a loss of accuracy of 1.7\% on Vim-S. It also generalizes well across different Mamba vision model architectures and different vision tasks. Our code will be made public.

View on arXiv

@article{wu2025_2504.04787,
  title={ Dynamic Vision Mamba },
  author={ Mengxuan Wu and Zekai Li and Zhiyuan Liang and Moyang Li and Xuanlei Zhao and Samir Khaki and Zheng Zhu and Xiaojiang Peng and Konstantinos N. Plataniotis and Kai Wang and Wangbo Zhao and Yang You },
  journal={arXiv preprint arXiv:2504.04787},
  year={ 2025 }
}

Comments on this paper