Communities
Connect sessions
AI calendar
Organizations
Join Slack
Contact Sales
Search
Open menu
Home
Papers
2412.04467
Cited By
VisionZip: Longer is Better but Not Necessary in Vision Language Models
Computer Vision and Pattern Recognition (CVPR), 2024
5 December 2024
Senqiao Yang
Yukang Chen
Zhuotao Tian
Chengyao Wang
Jingyao Li
Bei Yu
Jiaya Jia
VLM
Re-assign community
ArXiv (abs)
PDF
HTML
HuggingFace (117 upvotes)
Github (284★)
Papers citing
"VisionZip: Longer is Better but Not Necessary in Vision Language Models"
50 / 54 papers shown
Jina-VLM: Small Multilingual Vision Language Model
Andreas Koukounas
Georgios Mastrapas
Florian Hönicke
Sedigheh Eslami
Guillaume Roncari
Scott Martens
Han Xiao
MLLM
336
0
0
03 Dec 2025
AdaptVision: Efficient Vision-Language Models via Adaptive Visual Acquisition
Zichuan Lin
Y. Liu
Yang Yang
Lvfang Tao
Deheng Ye
VLM
98
0
0
03 Dec 2025
Script: Graph-Structured and Query-Conditioned Semantic Token Pruning for Multimodal Large Language Models
Zhongyu Yang
Dannong Xu
Wei Pang
Yingfang Yuan
VLM
185
0
0
01 Dec 2025
Can Multi-Modal LLMs Provide Live Step-by-Step Task Guidance?
Apratim Bhattacharyya
Bicheng Xu
Sanjay Haresh
Reza Pourreza
Litian Liu
Sunny Panchal
Pulkit Madan
Leonid Sigal
Roland Memisevic
112
0
0
27 Nov 2025
Object-Centric Vision Token Pruning for Vision Language Models
Guangyuan Li
R. Zhao
Jinhong Deng
Yanbo Wang
Joni Pajarinen
VLM
173
0
0
25 Nov 2025
Unboxing the Black Box: Mechanistic Interpretability for Algorithmic Understanding of Neural Networks
Bianka Kowalska
Halina Kwaśnicka
179
0
0
24 Nov 2025
FastMMoE: Accelerating Multimodal Large Language Models through Dynamic Expert Activation and Routing-Aware Token Pruning
Guoyang Xia
Yifeng Ding
Fengfa Li
Lei Ren
Wei Chen
Fangxiang Feng
Xiaojie Wang
MoE
VLM
187
0
0
22 Nov 2025
TimeViper: A Hybrid Mamba-Transformer Vision-Language Model for Efficient Long Video Understanding
Boshen Xu
Zihan Xiao
Jiaze Li
Jianzhong Ju
Zhenbo Luo
Jian Luan
Qin Jin
Mamba
533
0
0
20 Nov 2025
A Comprehensive Study on Visual Token Redundancy for Discrete Diffusion-based Multimodal Large Language Models
Duo Li
Zuhao Yang
Xiaoqin Zhang
Ling Shao
Shijian Lu
VLM
154
1
0
19 Nov 2025
OmniZip: Audio-Guided Dynamic Token Compression for Fast Omnimodal Large Language Models
Keda Tao
Kele Shao
Bohan Yu
Weiqiang Wang
Jian Liu
Huan Wang
VLM
253
2
0
18 Nov 2025
RedVTP: Training-Free Acceleration of Diffusion Vision-Language Models Inference via Masked Token-Guided Visual Token Pruning
Jingqi Xu
Jingxi Lu
Chenghao Li
Sreetama Sarkar
Souvik Kundu
Peter A. Beerel
VLM
176
0
0
16 Nov 2025
MergeMix: A Unified Augmentation Paradigm for Visual and Multi-Modal Understanding
Xin Jin
Siyuan Li
Siyong Jian
Kai Yu
Huan Wang
141
0
0
27 Oct 2025
Mixing Importance with Diversity: Joint Optimization for KV Cache Compression in Large Vision-Language Models
Xuyang Liu
Xiyan Gui
Y. Zhang
Linfeng Zhang
VLM
134
2
0
23 Oct 2025
StreamingTOM: Streaming Token Compression for Efficient Video Understanding
Xueyi Chen
Keda Tao
Kele Shao
Huan Wang
194
2
0
21 Oct 2025
VisionSelector: End-to-End Learnable Visual Token Compression for Efficient Multimodal LLMs
Jiaying Zhu
Yurui Zhu
Xin Lu
Wenrui Yan
Dong Li
Kunlin Liu
Xueyang Fu
Zheng-Jun Zha
MQ
VLM
249
0
0
18 Oct 2025
MARC: Memory-Augmented RL Token Compression for Efficient Video Understanding
Peiran Wu
Zhuorui Yu
Yunze Liu
Chi-Hao Wu
Enmin Zhou
Junxiao Shen
OffRL
VLM
95
1
0
09 Oct 2025
Don't Just Chase "Highlighted Tokens" in MLLMs: Revisiting Visual Holistic Context Retention
Xin Zou
Di Lu
Yizhou Wang
Yibo Yan
Yuanhuiyi Lyu
Xu Zheng
Linfeng Zhang
Xuming Hu
VLM
281
6
0
03 Oct 2025
VideoNSA: Native Sparse Attention Scales Video Understanding
Enxin Song
Wenhao Chai
Shusheng Yang
Ethan Armand
Xiaojun Shan
Haiyang Xu
Jianwen Xie
Zhuowen Tu
136
3
0
02 Oct 2025
HIVTP: A Training-Free Method to Improve VLMs Efficiency via Hierarchical Visual Token Pruning Using Middle-Layer-Based Importance Score
Jingqi Xu
Jingxi Lu
Chenghao Li
Sreetama Sarkar
Peter A. Beerel
156
1
0
28 Sep 2025
CoFFT: Chain of Foresight-Focus Thought for Visual Language Models
Xinyu Zhang
Yuxuan Dong
L. Zhang
Chengyou Jia
Zhuohang Dang
Basura Fernando
Jun Liu
Mike Zheng Shou
LRM
280
1
0
26 Sep 2025
GaussianVision: Vision-Language Alignment from Compressed Image Representations using 2D Gaussian Splatting
Yasmine Omri
Connor Ding
Tsachy Weissman
Thierry Tambe
3DGS
VLM
288
0
0
26 Sep 2025
Interpreting Attention Heads for Image-to-Text Information Flow in Large Vision-Language Models
Jinyeong Kim
Seil Kang
Jiwoo Park
Junhyeok Kim
Seong Jae Hwang
134
1
0
22 Sep 2025
Visual Representation Alignment for Multimodal Large Language Models
Heeji Yoon
Jaewoo Jung
J. Kim
Hyungyu Choi
Heeseong Shin
...
Jisang Han
Donghyun Kim
Chanho Eom
Sunghwan Hong
Seungryong Kim
125
10
0
09 Sep 2025
Video-based Generalized Category Discovery via Memory-Guided Consistency-Aware Contrastive Learning
Zhang Jing
Pu Nan
Xie Yu Xiang
Guo Yanming
Lu Qianqi
Zou Shiwei
Yan Jie
Chen Yan
CLL
128
1
0
08 Sep 2025
AVAM: Universal Training-free Adaptive Visual Anchoring Embedded into Multimodal Large Language Model for Multi-image Question Answering
Kang Zeng
Guojin Zhong
Jintao Cheng
Jin Yuan
Zhiyong Li
135
0
0
25 Aug 2025
MMTok: Multimodal Coverage Maximization for Efficient Inference of VLMs
Sixun Dong
Juhua Hu
Mian Zhang
Ming Yin
Yanjie Fu
Qi Qian
111
4
0
25 Aug 2025
VFlowOpt: A Token Pruning Framework for LMMs with Visual Information Flow-Guided Optimization
Sihan Yang
Runsen Xu
Chenhang Cui
Tai Wang
Dahua Lin
Jiangmiao Pang
135
2
0
07 Aug 2025
TransPrune: Token Transition Pruning for Efficient Large Vision-Language Model
Ao Li
Yuxiang Duan
Jinghui Zhang
Congbo Ma
Yutong Xie
G. Carneiro
Mohammad Yaqub
Hu Wang
140
0
0
28 Jul 2025
When Tokens Talk Too Much: A Survey of Multimodal Long-Context Token Compression across Images, Videos, and Audios
Kele Shao
Keda Tao
Kejia Zhang
Sicheng Feng
Mu Cai
Yuzhang Shang
Haoxuan You
Can Qin
Yang Sui
Huan Wang
508
11
0
27 Jul 2025
Mitigating Object Hallucinations via Sentence-Level Early Intervention
Shangpin Peng
Senqiao Yang
Li Jiang
Zhuotao Tian
MLLM
243
5
0
16 Jul 2025
Loss-Oriented Ranking for Automated Visual Prompting in LVLMs
Yuan Zhang
Chun-Kai Fan
Tao Huang
Ming Lu
Sicheng Yu
Junwen Pan
Kuan Cheng
Qi She
Shanghang Zhang
VLM
LRM
246
2
0
19 Jun 2025
Beyond Attention or Similarity: Maximizing Conditional Diversity for Token Pruning in MLLMs
Qizhe Zhang
Mengzhen Liu
Lichen Li
Ming Lu
Yuan Zhang
Junwen Pan
Qi She
Shanghang Zhang
VLM
395
18
0
12 Jun 2025
Dual-Priv Pruning : Efficient Differential Private Fine-Tuning in Multimodal Large Language Models
Qianshan Wei
Jiaqi Li
Zihan You
Yi Zhan
Kecen Li
...
Yi Yu
Bin Cao
Yiwen Xu
Wenshu Fan
Guilin Qi
AAML
VLM
160
1
0
08 Jun 2025
Generic Token Compression in Multimodal Large Language Models from an Explainability Perspective
Lei Lei
Jie Gu
Xiaokang Ma
Chu Tang
Jingmin Chen
Tong Xu
241
1
0
01 Jun 2025
One Trajectory, One Token: Grounded Video Tokenization via Panoptic Sub-object Trajectory
Chenhao Zheng
Jieyu Zhang
Mohammadreza Salehi
Ziqi Gao
Vishnu Iyengar
Norimasa Kobori
Quan Kong
Ranjay Krishna
375
2
0
29 May 2025
PixelThink: Towards Efficient Chain-of-Pixel Reasoning
Song Wang
Gongfan Fang
Lingdong Kong
Xiangtai Li
Jianyun Xu
Maochun Luo
Qiang Li
Jianke Zhu
Xinchao Wang
LRM
340
12
0
29 May 2025
Weakly Supervised Data Refinement and Flexible Sequence Compression for Efficient Thai LLM-based ASR
Mingchen Shao
Xinfa Zhu
C. Wang
Bingshen Mu
Hai Li
Ying Yan
Junhui Liu
Danming Xie
Lei Xie
178
2
0
28 May 2025
VScan: Rethinking Visual Token Reduction for Efficient Large Vision-Language Models
Ce Zhang
Kaixin Ma
Tianqing Fang
Wenhao Yu
Hongming Zhang
Zhisong Zhang
Yaqi Xie
Katia Sycara
Haitao Mi
Dong Yu
VLM
304
7
0
28 May 2025
HoliTom: Holistic Token Merging for Fast Video Large Language Models
Kele Shao
Keda Tao
Can Qin
Haoxuan You
Yang Sui
Huan Wang
VLM
613
15
0
27 May 2025
Streamline Without Sacrifice - Squeeze out Computation Redundancy in LMM
Penghao Wu
Lewei Lu
Ziwei Liu
282
1
0
21 May 2025
Top-Down Compression: Revisit Efficient Vision Token Projection for Visual Instruction Tuning
Bonan li
Zicheng Zhang
Songhua Liu
Weihao Yu
Xinchao Wang
VLM
334
2
0
17 May 2025
VCM: Vision Concept Modeling Based on Implicit Contrastive Learning with Vision-Language Instruction Fine-Tuning
Run Luo
Renke Shan
Longze Chen
Ziqiang Liu
Lu Wang
Min Yang
Xiaobo Xia
MLLM
VLM
508
3
0
28 Apr 2025
MMInference: Accelerating Pre-filling for Long-Context VLMs via Modality-Aware Permutation Sparse Attention
Yucheng Li
Huiqiang Jiang
Chengruidong Zhang
Qianhui Wu
Xufang Luo
...
Amir H. Abdi
Dongsheng Li
Jianfeng Gao
Yue Yang
Lili Qiu
350
17
0
22 Apr 2025
Video-MMLU: A Massive Multi-Discipline Lecture Understanding Benchmark
Enxin Song
Wenhao Chai
Weili Xu
Jianwen Xie
Yuxuan Liu
Gaoang Wang
397
20
0
20 Apr 2025
Beyond Intermediate States: Explaining Visual Redundancy through Language
Dingchen Yang
Bowen Cao
Anran Zhang
Weibo Gu
Winston Hu
Guang Chen
VLM
251
2
0
26 Mar 2025
Skip-Vision: Efficient and Scalable Acceleration of Vision-Language Models via Adaptive Token Skipping
Weili Zeng
Ziyuan Huang
Kaixiang Ji
Manwen Liao
VLM
621
4
0
26 Mar 2025
Scaling Vision Pre-Training to 4K Resolution
Computer Vision and Pattern Recognition (CVPR), 2025
Baifeng Shi
Boyi Li
Han Cai
Yaojie Lu
Sifei Liu
...
Jan Kautz
Enze Xie
Trevor Darrell
Pavlo Molchanov
Hongxu Yin
CLIP
901
12
0
25 Mar 2025
Growing a Twig to Accelerate Large Vision-Language Models
Zhenwei Shao
Mingyang Wang
Zhou Yu
Wenwen Pan
Yan Yang
Tao Wei
Hao Zhang
Ning Mao
Wei Chen
Jun Yu
VLM
353
6
0
18 Mar 2025
Task-Oriented Feature Compression for Multimodal Understanding via Device-Edge Co-Inference
Cheng Yuan
Ziqiang Liu
Jiashu Lv
Jiawei Shao
Yufei Jiang
Jing Zhang
Xuelong Li
344
6
0
17 Mar 2025
Does Your Vision-Language Model Get Lost in the Long Video Sampling Dilemma?
Tianyuan Qu
Longxiang Tang
Bohao Peng
Senqiao Yang
Bei Yu
Jiaya Jia
VLM
975
11
0
16 Mar 2025
1
2
Next