ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2503.21817
29
1

Skip-Vision: Efficient and Scalable Acceleration of Vision-Language Models via Adaptive Token Skipping

26 March 2025
Weili Zeng
Ziyuan Huang
Kaixiang Ji
Yichao Yan
    VLM
ArXivPDFHTML
Abstract

Transformer-based models have driven significant advancements in Multimodal Large Language Models (MLLMs), yet their computational costs surge drastically when scaling resolution, training data, and model parameters. A key bottleneck stems from the proliferation of visual tokens required for fine-grained image understanding. We propose Skip-Vision, a unified framework addressing both training and inference inefficiencies in vision-language models. On top of conventional token compression approaches, our method introduces two complementary acceleration strategies. For training acceleration, we observe that Feed-Forward Network (FFN) computations on visual tokens induce marginal feature updates. This motivates our Skip-FFN strategy, which bypasses FFN layers for redundant visual tokens. For inference acceleration, we design a selective KV-cache removal mechanism that prunes the skipped key-value pairs during decoding while preserving model performance. Experimental results demonstrate that Skip-Vision reduces training time by up to 35\%, inference FLOPs by 75\%, and latency by 45\%, while achieving comparable or superior performance to existing methods. Our work provides a practical solution for scaling high-performance MLLMs with enhanced efficiency.

View on arXiv
@article{zeng2025_2503.21817,
  title={ Skip-Vision: Efficient and Scalable Acceleration of Vision-Language Models via Adaptive Token Skipping },
  author={ Weili Zeng and Ziyuan Huang and Kaixiang Ji and Yichao Yan },
  journal={arXiv preprint arXiv:2503.21817},
  year={ 2025 }
}
Comments on this paper