ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2026 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2501.02430
  4. Cited By
FOLDER: Accelerating Multi-modal Large Language Models with Enhanced Performance
v1v2 (latest)

FOLDER: Accelerating Multi-modal Large Language Models with Enhanced Performance

5 January 2025
Haicheng Wang
Zhemeng Yu
Gabriele Spadaro
Chen Ju
Victor Quétu
Enzo Tartaglione
Enzo Tartaglione
    VLM
ArXiv (abs)PDFHTMLGithub (15★)

Papers citing "FOLDER: Accelerating Multi-modal Large Language Models with Enhanced Performance"

50 / 80 papers shown
Explore More, Learn Better: Parallel MLLM Embeddings under Mutual Information Minimization
Explore More, Learn Better: Parallel MLLM Embeddings under Mutual Information Minimization
Zhicheng Wang
Chen Ju
X. Chen
Shuai Xiao
Jinsong Lan
Xiaoyong Zhu
Ying Chen
Zhiguo Cao
268
2
0
03 Nov 2025
Wave-Particle (Continuous-Discrete) Dualistic Visual Tokenization for Unified Understanding and Generation
Wave-Particle (Continuous-Discrete) Dualistic Visual Tokenization for Unified Understanding and Generation
Yizhu Chen
Chen Ju
Z. Wang
Shuai Xiao
X. Chen
Jinsong Lan
Xiaoyong Zhu
Ying Chen
160
0
0
03 Nov 2025
VisionSelector: End-to-End Learnable Visual Token Compression for Efficient Multimodal LLMs
VisionSelector: End-to-End Learnable Visual Token Compression for Efficient Multimodal LLMs
Jiaying Zhu
Yurui Zhu
Xin Lu
Wenrui Yan
Dong Li
Kunlin Liu
Xueyang Fu
Zheng-Jun Zha
MQVLM
296
5
0
18 Oct 2025
Efficient Multi-modal Large Language Models via Progressive Consistency Distillation
Efficient Multi-modal Large Language Models via Progressive Consistency Distillation
Zichen Wen
Shaobo Wang
Yufa Zhou
J. Zhang
Qintong Zhang
...
Zhaorun Chen
Bin Wang
W. Li
Conghui He
Linfeng Zhang
VLM
227
15
0
01 Oct 2025
METEOR: Multi-Encoder Collaborative Token Pruning for Efficient Vision Language Models
METEOR: Multi-Encoder Collaborative Token Pruning for Efficient Vision Language Models
Yuchen Liu
Yaoming Wang
Bowen Shi
Xiaopeng Zhang
Wenrui Dai
Chenglin Li
Hongkai Xiong
Qi Tian
252
2
0
28 Jul 2025
A Survey of Token Compression for Efficient Multimodal Large Language Models
A Survey of Token Compression for Efficient Multimodal Large Language Models
Kele Shao
Keda Tao
Kejia Zhang
Sicheng Feng
Mu Cai
Yuzhang Shang
Haoxuan You
Can Qin
Yang Sui
Huan Wang
721
12
0
27 Jul 2025
Generic Token Compression in Multimodal Large Language Models from an Explainability Perspective
Generic Token Compression in Multimodal Large Language Models from an Explainability Perspective
Lei Lei
Jie Gu
Xiaokang Ma
Chu Tang
Jingmin Chen
Tong Xu
305
1
0
01 Jun 2025
VScan: Rethinking Visual Token Reduction for Efficient Large Vision-Language Models
VScan: Rethinking Visual Token Reduction for Efficient Large Vision-Language Models
Ce Zhang
Kaixin Ma
Tianqing Fang
Wenhao Yu
Hongming Zhang
Zhisong Zhang
Yaqi Xie
Katia Sycara
VLM
403
23
0
28 May 2025
HoliTom: Holistic Token Merging for Fast Video Large Language Models
HoliTom: Holistic Token Merging for Fast Video Large Language Models
Kele Shao
Keda Tao
Can Qin
Haoxuan You
Yang Sui
Huan Wang
VLM
774
36
0
27 May 2025
Video-MMLU: A Massive Multi-Discipline Lecture Understanding Benchmark
Video-MMLU: A Massive Multi-Discipline Lecture Understanding Benchmark
Enxin Song
Wenhao Chai
Weili Xu
Jianwen Xie
Yuxuan Liu
Gaoang Wang
543
31
0
20 Apr 2025
Squeeze Out Tokens from Sample for Finer-Grained Data Governance
Squeeze Out Tokens from Sample for Finer-Grained Data Governance
Weixiong Lin
Chen Ju
Haicheng Wang
Shengchao Hu
Shuai Xiao
...
Yuheng Jiao
Mingshuai Yao
Jinsong Lan
Qingwen Liu
Ying Chen
331
3
0
18 Mar 2025
TokenCarve: Information-Preserving Visual Token Compression in Multimodal Large Language Models
TokenCarve: Information-Preserving Visual Token Compression in Multimodal Large Language Models
Xudong Tan
Peng Ye
Chongjun Tu
Jianjian Cao
Yaoxin Yang
Lin Zhang
Dongzhan Zhou
Tao Chen
VLM
521
12
0
13 Mar 2025
VisionZip: Longer is Better but Not Necessary in Vision Language Models
VisionZip: Longer is Better but Not Necessary in Vision Language ModelsComputer Vision and Pattern Recognition (CVPR), 2024
Senqiao Yang
Yukang Chen
Zhuotao Tian
Chengyao Wang
Jingyao Li
Bei Yu
Jiaya Jia
VLM
338
145
0
05 Dec 2024
Advancing Myopia To Holism: Fully Contrastive Language-Image
  Pre-training
Advancing Myopia To Holism: Fully Contrastive Language-Image Pre-trainingComputer Vision and Pattern Recognition (CVPR), 2024
Haicheng Wang
Chen Ju
Weixiong Lin
Shuai Xiao
Mengting Chen
...
Mingshuai Yao
Jinsong Lan
Ying Chen
Qingwen Liu
Yanfeng Wang
VLMCLIP
407
14
0
30 Nov 2024
Geometric Analysis of Reasoning Trajectories: A Phase Space Approach to Understanding Valid and Invalid Multi-Hop Reasoning in LLMs
Geometric Analysis of Reasoning Trajectories: A Phase Space Approach to Understanding Valid and Invalid Multi-Hop Reasoning in LLMs
Javier Marin
LRM
686
276
0
06 Oct 2024
Eagle: Exploring The Design Space for Multimodal LLMs with Mixture of Encoders
Eagle: Exploring The Design Space for Multimodal LLMs with Mixture of Encoders
Min Shi
Fuxiao Liu
Shihao Wang
Shijia Liao
Subhashree Radhakrishnan
...
Andrew Tao
Andrew Tao
Zhiding Yu
Guilin Liu
Guilin Liu
MLLM
452
128
0
28 Aug 2024
Goldfish: Vision-Language Understanding of Arbitrarily Long Videos
Goldfish: Vision-Language Understanding of Arbitrarily Long Videos
Kirolos Ataallah
Xiaoqian Shen
Eslam Abdelrahman
Essam Sleiman
Mingchen Zhuge
Jian Ding
Deyao Zhu
Jürgen Schmidhuber
Mohamed Elhoseiny
VLM
290
42
0
17 Jul 2024
Turbo: Informativity-Driven Acceleration Plug-In for Vision-Language
  Large Models
Turbo: Informativity-Driven Acceleration Plug-In for Vision-Language Large Models
Chen Ju
Haicheng Wang
Haozhe Cheng
Xu Chen
Zhonghua Zhai
Weilin Huang
Jinsong Lan
Shuai Xiao
Bo Zheng
VLM
246
17
0
16 Jul 2024
MMBench-Video: A Long-Form Multi-Shot Benchmark for Holistic Video
  Understanding
MMBench-Video: A Long-Form Multi-Shot Benchmark for Holistic Video Understanding
Xinyu Fang
Kangrui Mao
Haodong Duan
Xiangyu Zhao
Yining Li
Dahua Lin
Kai Chen
VLM
265
175
0
20 Jun 2024
Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis
Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis
Chaoyou Fu
Yuhan Dai
Yondong Luo
Lei Li
Shuhuai Ren
...
Xiawu Zheng
Enhong Chen
Caifeng Shan
Xing Sun
Xing Sun
VLMMLLM
889
1,086
0
31 May 2024
DENOISER: Rethinking the Robustness for Open-Vocabulary Action
  Recognition
DENOISER: Rethinking the Robustness for Open-Vocabulary Action Recognition
Haozhe Cheng
Chen Ju
Haicheng Wang
Jinxiang Liu
Mengting Chen
Qiang Hu
Xiaoyun Zhang
Yanfeng Wang
DiffMVLM
266
8
0
23 Apr 2024
BRAVE: Broadening the visual encoding of vision-language models
BRAVE: Broadening the visual encoding of vision-language modelsEuropean Conference on Computer Vision (ECCV), 2024
Ouguzhan Fatih Kar
A. Tonioni
Petra Poklukar
Achin Kulshrestha
Amir Zamir
Federico Tombari
MLLMVLM
375
67
0
10 Apr 2024
Are We on the Right Way for Evaluating Large Vision-Language Models?
Are We on the Right Way for Evaluating Large Vision-Language Models?
Lin Chen
Jinsong Li
Xiao-wen Dong
Pan Zhang
Yuhang Zang
...
Haodong Duan
Yuan Liu
Yu Qiao
Dahua Lin
Feng Zhao
VLM
513
706
0
29 Mar 2024
LLaVA-PruMerge: Adaptive Token Reduction for Efficient Large Multimodal
  Models
LLaVA-PruMerge: Adaptive Token Reduction for Efficient Large Multimodal Models
Yuzhang Shang
Mu Cai
Bingxin Xu
Yong Jae Lee
Yan Yan
VLM
715
285
0
22 Mar 2024
Wear-Any-Way: Manipulable Virtual Try-on via Sparse Correspondence
  Alignment
Wear-Any-Way: Manipulable Virtual Try-on via Sparse Correspondence Alignment
Mengting Chen
Xi Chen
Zhonghua Zhai
Chen Ju
Xuewen Hong
Jinsong Lan
Shuai Xiao
OODDiffM
306
44
0
19 Mar 2024
Audio-Visual Segmentation via Unlabeled Frame Exploitation
Audio-Visual Segmentation via Unlabeled Frame Exploitation
Jinxiang Liu
Yikun Liu
Fei Zhang
Chen Ju
Ya Zhang
Yanfeng Wang
389
29
0
17 Mar 2024
An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference
  Acceleration for Large Vision-Language Models
An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language ModelsEuropean Conference on Computer Vision (ECCV), 2024
Liang Chen
Haozhe Zhao
Tianyu Liu
Shuai Bai
Junyang Lin
Chang Zhou
Baobao Chang
MLLMVLM
433
468
0
11 Mar 2024
Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs
Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMsComputer Vision and Pattern Recognition (CVPR), 2024
Shengbang Tong
Zhuang Liu
Yuexiang Zhai
Yi-An Ma
Yann LeCun
Saining Xie
VLMMLLM
518
668
0
11 Jan 2024
InternVL: Scaling up Vision Foundation Models and Aligning for Generic
  Visual-Linguistic Tasks
InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks
Zhe Chen
Jiannan Wu
Wenhai Wang
Weijie Su
Guo Chen
...
Bin Li
Ping Luo
Tong Lu
Yu Qiao
Jifeng Dai
VLMMLLM
792
2,644
0
21 Dec 2023
Turbo: Informativity-Driven Acceleration Plug-In for Vision-Language
  Models
Turbo: Informativity-Driven Acceleration Plug-In for Vision-Language Models
Chen Ju
Haicheng Wang
Zeqian Li
Xu Chen
Zhonghua Zhai
Weilin Huang
Shuai Xiao
VLM
366
11
0
12 Dec 2023
MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning
  Benchmark for Expert AGI
MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGIComputer Vision and Pattern Recognition (CVPR), 2023
Xiang Yue
Yuansheng Ni
Kai Zhang
Tianyu Zheng
Ruoqi Liu
...
Yibo Liu
Wenhao Huang
Huan Sun
Yu-Chuan Su
Wenhu Chen
OSLMELMVLM
969
1,945
0
27 Nov 2023
Video-LLaVA: Learning United Visual Representation by Alignment Before
  Projection
Video-LLaVA: Learning United Visual Representation by Alignment Before Projection
Bin Lin
Yang Ye
Bin Zhu
Jiaxi Cui
Munan Ning
Peng Jin
Li-ming Yuan
VLMMLLM
1.8K
1,402
0
16 Nov 2023
Chat-UniVi: Unified Visual Representation Empowers Large Language Models
  with Image and Video Understanding
Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video UnderstandingComputer Vision and Pattern Recognition (CVPR), 2023
Peng Jin
Ryuichi Takanobu
Caiwan Zhang
Xiaochun Cao
Li-ming Yuan
MLLM
638
401
0
14 Nov 2023
HallusionBench: An Advanced Diagnostic Suite for Entangled Language
  Hallucination and Visual Illusion in Large Vision-Language Models
HallusionBench: An Advanced Diagnostic Suite for Entangled Language Hallucination and Visual Illusion in Large Vision-Language ModelsComputer Vision and Pattern Recognition (CVPR), 2023
Tianrui Guan
Fuxiao Liu
Xiyang Wu
Ruiqi Xian
Zongxia Li
...
Lichang Chen
Furong Huang
Yaser Yacoob
Dinesh Manocha
Wanrong Zhu
VLMMLLM
630
471
0
23 Oct 2023
Butterfly Effects of SGD Noise: Error Amplification in Behavior Cloning
  and Autoregression
Butterfly Effects of SGD Noise: Error Amplification in Behavior Cloning and AutoregressionInternational Conference on Learning Representations (ICLR), 2023
Adam Block
Dylan J. Foster
Akshay Krishnamurthy
Max Simchowitz
Cyril Zhang
319
11
0
17 Oct 2023
MiniGPT-v2: large language model as a unified interface for
  vision-language multi-task learning
MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning
Jun Chen
Deyao Zhu
Xiaoqian Shen
Xiang Li
Zechun Liu
Pengchuan Zhang
Raghuraman Krishnamoorthi
Vikas Chandra
Yunyang Xiong
Mohamed Elhoseiny
MLLM
1.7K
677
0
14 Oct 2023
Improved Baselines with Visual Instruction Tuning
Improved Baselines with Visual Instruction TuningComputer Vision and Pattern Recognition (CVPR), 2023
Haotian Liu
Chunyuan Li
Yuheng Li
Yong Jae Lee
VLMMLLM
746
4,820
0
05 Oct 2023
Ring Attention with Blockwise Transformers for Near-Infinite Context
Ring Attention with Blockwise Transformers for Near-Infinite ContextInternational Conference on Learning Representations (ICLR), 2023
Hao Liu
Matei A. Zaharia
Pieter Abbeel
803
446
0
03 Oct 2023
AttrSeg: Open-Vocabulary Semantic Segmentation via Attribute
  Decomposition-Aggregation
AttrSeg: Open-Vocabulary Semantic Segmentation via Attribute Decomposition-AggregationNeural Information Processing Systems (NeurIPS), 2023
Chaofan Ma
Yu-Hao Yang
Chen Ju
Fei Zhang
Ya Zhang
Yanfeng Wang
VLM
517
34
0
31 Aug 2023
Beyond Generic: Enhancing Image Captioning with Real-World Knowledge
  using Vision-Language Pre-Training Model
Beyond Generic: Enhancing Image Captioning with Real-World Knowledge using Vision-Language Pre-Training ModelACM Multimedia (ACM MM), 2023
Ka Leong Cheng
Wenpo Song
Zheng Ma
Wenhao Zhu
Zi-Yue Zhu
Jianbing Zhang
CLIPVLM
213
22
0
02 Aug 2023
SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension
SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension
Bohao Li
Rui Wang
Guangzhi Wang
Yuying Ge
Yixiao Ge
Ying Shan
MLLMELM
635
875
0
30 Jul 2023
Audio-aware Query-enhanced Transformer for Audio-Visual Segmentation
Audio-aware Query-enhanced Transformer for Audio-Visual Segmentation
Jinxian Liu
Chen Ju
Chaofan Ma
Yanfeng Wang
Yu Wang
Ya Zhang
VOS
433
39
0
25 Jul 2023
FlashAttention-2: Faster Attention with Better Parallelism and Work
  Partitioning
FlashAttention-2: Faster Attention with Better Parallelism and Work PartitioningInternational Conference on Learning Representations (ICLR), 2023
Tri Dao
LRM
627
2,480
0
17 Jul 2023
MMBench: Is Your Multi-modal Model an All-around Player?
MMBench: Is Your Multi-modal Model an All-around Player?European Conference on Computer Vision (ECCV), 2023
Yuanzhan Liu
Haodong Duan
Yuanhan Zhang
Yue Liu
Songyang Zhang
...
Yuan Liu
Conghui He
Ziwei Liu
Kai-xiang Chen
Dahua Lin
995
1,956
0
12 Jul 2023
Multi-Modal Prototypes for Open-World Semantic Segmentation
Multi-Modal Prototypes for Open-World Semantic SegmentationInternational Journal of Computer Vision (IJCV), 2023
Yu-Hao Yang
Chaofan Ma
Chen Ju
Fei Zhang
Jiangchao Yao
Ya Zhang
Yanfeng Wang
VLM
438
20
0
05 Jul 2023
MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models
MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models
Chaoyou Fu
Peixian Chen
Chunjiang Ge
Yulei Qin
Mengdan Zhang
...
Xing Sun
Zhenyu Qiu
Rongrong Ji
Caifeng Shan
Ran He
ELMMLLM
1.1K
1,448
0
23 Jun 2023
Q: How to Specialize Large Vision-Language Models to Data-Scarce VQA
  Tasks? A: Self-Train on Unlabeled Images!
Q: How to Specialize Large Vision-Language Models to Data-Scarce VQA Tasks? A: Self-Train on Unlabeled Images!Computer Vision and Pattern Recognition (CVPR), 2023
Zaid Khan
B. Vijaykumar
S. Schulter
Xiang Yu
Y. Fu
Manmohan Chandraker
VLMMLLM
340
26
0
06 Jun 2023
Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video
  Understanding
Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video UnderstandingConference on Empirical Methods in Natural Language Processing (EMNLP), 2023
Hang Zhang
Xin Li
Lidong Bing
MLLM
798
1,669
0
05 Jun 2023
Annotation-free Audio-Visual Segmentation
Annotation-free Audio-Visual SegmentationIEEE Workshop/Winter Conference on Applications of Computer Vision (WACV), 2023
Jinxian Liu
Yu Wang
Chen Ju
Chaofan Ma
Ya Zhang
Weidi Xie
VOSVLM
456
54
0
18 May 2023
Evaluating Object Hallucination in Large Vision-Language Models
Evaluating Object Hallucination in Large Vision-Language ModelsConference on Empirical Methods in Natural Language Processing (EMNLP), 2023
Yifan Li
Yifan Du
Kun Zhou
Jinpeng Wang
Wayne Xin Zhao
Ji-Rong Wen
MLLMLRM
855
1,526
0
17 May 2023
12
Next
Page 1 of 2