ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2402.03766
  4. Cited By
MobileVLM V2: Faster and Stronger Baseline for Vision Language Model

MobileVLM V2: Faster and Stronger Baseline for Vision Language Model

6 February 2024
Xiangxiang Chu
Limeng Qiao
Xinyu Zhang
Shuang Xu
Fei Wei
Yang Yang
Xiaofei Sun
Yiming Hu
Xinyang Lin
Bo Zhang
Chunhua Shen
    VLMMLLM
ArXiv (abs)PDFHTMLHuggingFace (15 upvotes)Github (1219★)

Papers citing "MobileVLM V2: Faster and Stronger Baseline for Vision Language Model"

50 / 91 papers shown
Title
CoSense-LLM: Semantics at the Edge with Cost- and Uncertainty-Aware Cloud-Edge Cooperation
CoSense-LLM: Semantics at the Edge with Cost- and Uncertainty-Aware Cloud-Edge Cooperation
Hasan Akgul
Mari Eplik
Javier Rojas
Aina Binti Abdullah
Pieter van der Merwe
43
0
0
22 Oct 2025
The False Promise of Zero-Shot Super-Resolution in Machine-Learned Operators
The False Promise of Zero-Shot Super-Resolution in Machine-Learned Operators
Mansi Sakarvadia
Kareem Hegazy
A. Totounferoush
Kyle Chard
Yaoqing Yang
Ian Foster
Michael W. Mahoney
SupR
132
1
0
08 Oct 2025
GUI-KV: Efficient GUI Agents via KV Cache with Spatio-Temporal Awareness
GUI-KV: Efficient GUI Agents via KV Cache with Spatio-Temporal Awareness
Kung-Hsiang Huang
Haoyi Qiu
Yutong Dai
Caiming Xiong
Chien-Sheng Wu
20
0
0
01 Oct 2025
Efficient Multi-modal Large Language Models via Progressive Consistency Distillation
Efficient Multi-modal Large Language Models via Progressive Consistency Distillation
Zichen Wen
Shaobo Wang
Yufa Zhou
J. Zhang
Qintong Zhang
...
Zhaorun Chen
Bin Wang
W. Li
Conghui He
Linfeng Zhang
VLM
32
3
0
01 Oct 2025
Discrete Guidance Matching: Exact Guidance for Discrete Flow Matching
Discrete Guidance Matching: Exact Guidance for Discrete Flow Matching
Zhengyan Wan
Yidong Ouyang
Liyan Xie
Fang Fang
Hongyuan Zha
Guang Cheng
44
0
0
26 Sep 2025
Edge-Based Multimodal Sensor Data Fusion with Vision Language Models (VLMs) for Real-time Autonomous Vehicle Accident Avoidance
Edge-Based Multimodal Sensor Data Fusion with Vision Language Models (VLMs) for Real-time Autonomous Vehicle Accident Avoidance
Fengze Yang
Bo Yu
Yang Zhou
Xuewen Luo
Zhengzhong Tu
Chenxi Liu
102
1
0
01 Aug 2025
AI in Agriculture: A Survey of Deep Learning Techniques for Crops, Fisheries and Livestock
AI in Agriculture: A Survey of Deep Learning Techniques for Crops, Fisheries and Livestock
Umair Nawaz
Muhammad Zaigham Zaheer
Fahad Shahbaz Khan
Hisham Cholakkal
Salman Khan
Rao Muhammad Anwer
64
1
0
29 Jul 2025
When Tokens Talk Too Much: A Survey of Multimodal Long-Context Token Compression across Images, Videos, and Audios
When Tokens Talk Too Much: A Survey of Multimodal Long-Context Token Compression across Images, Videos, and Audios
Kele Shao
Keda Tao
Kejia Zhang
Sicheng Feng
Mu Cai
Yuzhang Shang
Haoxuan You
Can Qin
Yang Sui
Huan Wang
237
5
0
27 Jul 2025
LaVi: Efficient Large Vision-Language Models via Internal Feature Modulation
LaVi: Efficient Large Vision-Language Models via Internal Feature Modulation
Tongtian Yue
Longteng Guo
Yepeng Tang
Zijia Zhao
Xinxin Zhu
Hua Huang
Jing Liu
MLLMVLM
106
1
0
20 Jun 2025
HaploOmni: Unified Single Transformer for Multimodal Video Understanding and Generation
HaploOmni: Unified Single Transformer for Multimodal Video Understanding and Generation
Yicheng Xiao
Lin Song
Rui Yang
Cheng Cheng
Zunnan Xu
Zhaoyang Zhang
Yixiao Ge
Xiu Li
Mingyu Ding
164
4
0
03 Jun 2025
EvoMoE: Expert Evolution in Mixture of Experts for Multimodal Large Language Models
EvoMoE: Expert Evolution in Mixture of Experts for Multimodal Large Language Models
Linglin Jing
Yuting Gao
Zhigang Wang
Wang Lan
Yiwen Tang
Wenhai Wang
Kaipeng Zhang
Qingpei Guo
MoE
103
1
0
28 May 2025
Embodied AI with Foundation Models for Mobile Service Robots: A Systematic Review
Embodied AI with Foundation Models for Mobile Service Robots: A Systematic Review
Matthew Lisondra
B. Benhabib
G. Nejat
LM&Ro
151
1
0
26 May 2025
FUDOKI: Discrete Flow-based Unified Understanding and Generation via Kinetic-Optimal Velocities
FUDOKI: Discrete Flow-based Unified Understanding and Generation via Kinetic-Optimal Velocities
Jin Wang
Yao Lai
Aoxue Li
Shifeng Zhang
Jiacheng Sun
Ning Kang
Chengyue Wu
Zhenguo Li
Ping Luo
201
13
0
26 May 2025
Robustifying Vision-Language Models via Dynamic Token Reweighting
Robustifying Vision-Language Models via Dynamic Token Reweighting
Tanqiu Jiang
Jiacheng Liang
Rongyi Zhu
Jiawei Zhou
Fenglong Ma
Ting Wang
AAML
199
1
0
22 May 2025
Communication-Efficient Hybrid Language Model via Uncertainty-Aware Opportunistic and Compressed Transmission
Communication-Efficient Hybrid Language Model via Uncertainty-Aware Opportunistic and Compressed Transmission
Seungeun Oh
Jinhyuk Kim
Jihong Park
Seung-Woo Ko
Jinho Choi
Tony Q. S. Quek
Seong-Lyun Kim
159
0
0
17 May 2025
Top-Down Compression: Revisit Efficient Vision Token Projection for Visual Instruction Tuning
Top-Down Compression: Revisit Efficient Vision Token Projection for Visual Instruction Tuning
Bonan li
Zicheng Zhang
Songhua Liu
Weihao Yu
Xinchao Wang
VLM
258
1
0
17 May 2025
Flash-VL 2B: Optimizing Vision-Language Model Performance for Ultra-Low Latency and High Throughput
Flash-VL 2B: Optimizing Vision-Language Model Performance for Ultra-Low Latency and High Throughput
Bo Zhang
Shuo Li
Runhe Tian
Yang Yang
Jixin Tang
Jinhao Zhou
Lin Ma
VLM
169
5
0
14 May 2025
Visual Instruction Tuning with Chain of Region-of-Interest
Visual Instruction Tuning with Chain of Region-of-Interest
Yixin Chen
Shuai Zhang
Boran Han
Bernie Wang
162
2
0
11 May 2025
Multimodal Large Language Models for Enhanced Traffic Safety: A Comprehensive Review and Future Trends
Multimodal Large Language Models for Enhanced Traffic Safety: A Comprehensive Review and Future Trends
M. Tami
Mohammed Elhenawy
Huthaifa I. Ashqar
184
1
0
21 Apr 2025
Fwd2Bot: LVLM Visual Token Compression with Double Forward Bottleneck
Fwd2Bot: LVLM Visual Token Compression with Double Forward Bottleneck
Adrian Bulat
Yassine Ouali
Georgios Tzimiropoulos
730
0
0
27 Mar 2025
Harmonizing Visual Representations for Unified Multimodal Understanding and Generation
Harmonizing Visual Representations for Unified Multimodal Understanding and Generation
Size Wu
Feiyu Xiong
Lumin Xu
Sheng Jin
Zhonghua Wu
Qingyi Tao
Wentao Liu
Wei Li
Chen Change Loy
VGen
788
24
0
27 Mar 2025
InternVL-X: Advancing and Accelerating InternVL Series with Efficient Visual Token Compression
InternVL-X: Advancing and Accelerating InternVL Series with Efficient Visual Token Compression
Dongchen Lu
Yuyao Sun
Zilu Zhang
Leping Huang
Jianliang Zeng
Mao Shu
Huo Cao
260
6
0
27 Mar 2025
Dynamic Pyramid Network for Efficient Multimodal Large Language Model
Dynamic Pyramid Network for Efficient Multimodal Large Language Model
Hao Ai
Kunyi Wang
Zezhou Wang
H. Lu
Jin Tian
Yaxin Luo
Peng-Fei Xing
Jen-Yuan Huang
Huaxia Li
Gen Luo
MLLMVLM
245
1
0
26 Mar 2025
OmniMamba: Efficient and Unified Multimodal Understanding and Generation via State Space Models
Jialv Zou
Bencheng Liao
Qian Zhang
Wenyu Liu
Xinggang Wang
MambaMLLM
228
2
0
11 Mar 2025
GenieBlue: Integrating both Linguistic and Multimodal Capabilities for Large Language Models on Mobile Devices
Xudong Lu
Yinghao Chen
Renshou Wu
Haohao Gao
Xi Chen
...
Fangyuan Li
Yafei Wen
Xiaoxin Chen
Shuai Ren
Hongsheng Li
241
0
0
08 Mar 2025
Towards Enhanced Image Generation Via Multi-modal Chain of Thought in Unified Generative Models
Towards Enhanced Image Generation Via Multi-modal Chain of Thought in Unified Generative Models
Yi Wang
Mushui Liu
Wanggui He
Longxiang Zhang
Longxiang Zhang
...
Weilong Dai
Weilong Dai
Mingli Song
Hao Jiang
Jie Song
MLLMMoELRM
218
12
0
03 Mar 2025
Vision-Language Models for Edge Networks: A Comprehensive Survey
Vision-Language Models for Edge Networks: A Comprehensive SurveyIEEE Internet of Things Journal (IEEE IoT J.), 2025
Ahmed Sharshar
Latif U. Khan
Waseem Ullah
Mohsen Guizani
VLM
254
10
0
11 Feb 2025
LLaVA-Mini: Efficient Image and Video Large Multimodal Models with One Vision Token
LLaVA-Mini: Efficient Image and Video Large Multimodal Models with One Vision TokenInternational Conference on Learning Representations (ICLR), 2025
Shaolei Zhang
Qingkai Fang
Zhe Yang
Yang Feng
MLLMVLM
262
82
0
07 Jan 2025
Efficient Architectures for High Resolution Vision-Language ModelsInternational Conference on Computational Linguistics (COLING), 2025
Miguel Carvalho
Bruno Martins
MLLMVLM
103
1
0
05 Jan 2025
VisionLLM v2: An End-to-End Generalist Multimodal Large Language Model for Hundreds of Vision-Language Tasks
VisionLLM v2: An End-to-End Generalist Multimodal Large Language Model for Hundreds of Vision-Language TasksNeural Information Processing Systems (NeurIPS), 2024
Jiannan Wu
Muyan Zhong
Sen Xing
Zeqiang Lai
Zhaoyang Liu
...
Lewei Lu
Tong Lu
Ping Luo
Yu Qiao
Jifeng Dai
MLLMVLMLRM
527
99
0
03 Jan 2025
A Comprehensive Survey of Large Language Models and Multimodal Large Language Models in Medicine
A Comprehensive Survey of Large Language Models and Multimodal Large Language Models in MedicineInformation Fusion (Inf. Fusion), 2024
Hanguang Xiao
Feizhong Zhou
Xianglong Liu
Tianqi Liu
Zhipeng Li
Xin Liu
Xiaoxuan Huang
AILawLM&MALRM
259
58
0
31 Dec 2024
Foundation Models and Adaptive Feature Selection: A Synergistic Approach
  to Video Question Answering
Foundation Models and Adaptive Feature Selection: A Synergistic Approach to Video Question AnsweringIEEE Workshop/Winter Conference on Applications of Computer Vision (WACV), 2024
Sai Bhargav Rongali
M. Cui
Ankit Jha
Neha Bhargava
Saurabh Prasad
Biplab Banerjee
197
0
0
12 Dec 2024
Olympus: A Universal Task Router for Computer Vision Tasks
Olympus: A Universal Task Router for Computer Vision TasksComputer Vision and Pattern Recognition (CVPR), 2024
Yuanze Lin
Yunsheng Li
Dongdong Chen
Weijian Xu
Ronald Clark
Juil Sock
VLMObjD
911
2
0
12 Dec 2024
Align-KD: Distilling Cross-Modal Alignment Knowledge for Mobile
  Vision-Language Model
Align-KD: Distilling Cross-Modal Alignment Knowledge for Mobile Vision-Language Model
Qianhan Feng
Wenshuo Li
Tong Lin
Xinghao Chen
VLM
186
6
0
02 Dec 2024
VLsI: Verbalized Layers-to-Interactions from Large to Small Vision Language Models
VLsI: Verbalized Layers-to-Interactions from Large to Small Vision Language ModelsComputer Vision and Pattern Recognition (CVPR), 2024
Byung-Kwan Lee
Ryo Hachiuma
Yu-Chiang Frank Wang
Y. Ro
Yueh-Hua Wu
VLM
231
3
0
02 Dec 2024
Enhancing Instruction-Following Capability of Visual-Language Models by
  Reducing Image Redundancy
Enhancing Instruction-Following Capability of Visual-Language Models by Reducing Image Redundancy
Te Yang
Jian Jia
Xiangyu Zhu
Weisong Zhao
Bo Wang
...
Shengyuan Liu
Quan Chen
Peng Jiang
Kun Gai
Zhen Lei
110
3
0
23 Nov 2024
freePruner: A Training-free Approach for Large Multimodal Model
  Acceleration
freePruner: A Training-free Approach for Large Multimodal Model Acceleration
Bingxin Xu
Yuzhang Shang
Yunhao Ge
Qian Lou
Yan Yan
202
4
0
23 Nov 2024
MMGenBench: Fully Automatically Evaluating LMMs from the Text-to-Image Generation Perspective
MMGenBench: Fully Automatically Evaluating LMMs from the Text-to-Image Generation Perspective
Hailang Huang
Yong Wang
Zixuan Huang
Huaqiu Li
Tongwen Huang
Xiangxiang Chu
Richong Zhang
MLLMLM&MAEGVM
217
2
0
21 Nov 2024
Advancing Fine-Grained Visual Understanding with Multi-Scale Alignment
  in Multi-Modal Models
Advancing Fine-Grained Visual Understanding with Multi-Scale Alignment in Multi-Modal Models
Wei Wang
Hao Sun
Qi Xu
Linfeng Li
Yiqing Cai
Botian Jiang
Hang Song
Xingcan Hu
Pengyu Wang
Li Xiao
120
7
0
14 Nov 2024
HOBBIT: A Mixed Precision Expert Offloading System for Fast MoE
  Inference
HOBBIT: A Mixed Precision Expert Offloading System for Fast MoE Inference
Peng Tang
Jiacheng Liu
X. Hou
Yifei Pu
Jing Wang
Pheng-Ann Heng
Chong Li
Minyi Guo
MoE
196
22
0
03 Nov 2024
Revealing and Reducing Gender Biases in Vision and Language Assistants (VLAs)
Revealing and Reducing Gender Biases in Vision and Language Assistants (VLAs)International Conference on Learning Representations (ICLR), 2024
Leander Girrbach
Yiran Huang
Stephan Alaniz
Trevor Darrell
Zeynep Akata
VLM
259
6
0
25 Oct 2024
Mini-InternVL: A Flexible-Transfer Pocket Multimodal Model with 5%
  Parameters and 90% Performance
Mini-InternVL: A Flexible-Transfer Pocket Multimodal Model with 5% Parameters and 90% Performance
Zhangwei Gao
Zhe Chen
Erfei Cui
Yiming Ren
Weiyun Wang
...
Lewei Lu
Tong Lu
Yu Qiao
Jifeng Dai
Wenhai Wang
VLM
285
74
0
21 Oct 2024
Janus: Decoupling Visual Encoding for Unified Multimodal Understanding
  and Generation
Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and GenerationComputer Vision and Pattern Recognition (CVPR), 2024
Chengyue Wu
Xiaokang Chen
Z. F. Wu
Yiyang Ma
Xingchao Liu
...
Wen Liu
Zhenda Xie
Xingkai Yu
Chong Ruan
Ping Luo
AI4TS
210
205
0
17 Oct 2024
Spatial-Aware Efficient Projector for MLLMs via Multi-Layer Feature
  Aggregation
Spatial-Aware Efficient Projector for MLLMs via Multi-Layer Feature Aggregation
Shun Qian
Bingquan Liu
Chengjie Sun
Zhen Xu
Baoxun Wang
80
0
0
14 Oct 2024
Mono-InternVL: Pushing the Boundaries of Monolithic Multimodal Large Language Models with Endogenous Visual Pre-training
Mono-InternVL: Pushing the Boundaries of Monolithic Multimodal Large Language Models with Endogenous Visual Pre-trainingComputer Vision and Pattern Recognition (CVPR), 2024
Gen Luo
Xue Yang
Wenhan Dou
Zhaokai Wang
Jifeng Dai
Jifeng Dai
Yu Qiao
Xizhou Zhu
VLMMLLM
229
60
0
10 Oct 2024
Treat Visual Tokens as Text? But Your MLLM Only Needs Fewer Efforts to
  See
Treat Visual Tokens as Text? But Your MLLM Only Needs Fewer Efforts to See
Phu Pham
Phu Pham
Kun Wan
Yu-Jhe Li
Zeliang Zhang
Daniel Miranda
Ajinkya Kale
Ajinkya Kale
Chenliang Xu
150
14
0
08 Oct 2024
EMMA: Empowering Multi-modal Mamba with Structural and Hierarchical
  Alignment
EMMA: Empowering Multi-modal Mamba with Structural and Hierarchical AlignmentInternational Conference on Learning Representations (ICLR), 2024
Yifei Xing
Xiangyuan Lan
Ruiping Wang
Shihong Deng
Wenjun Huang
Qingfang Zheng
Yaowei Wang
Mamba
175
2
0
08 Oct 2024
VMAD: Visual-enhanced Multimodal Large Language Model for Zero-Shot
  Anomaly Detection
VMAD: Visual-enhanced Multimodal Large Language Model for Zero-Shot Anomaly Detection
Huilin Deng
Hongchen Luo
Wei Zhai
Yang Cao
Yu Kang
134
5
0
30 Sep 2024
MM-CamObj: A Comprehensive Multimodal Dataset for Camouflaged Object
  Scenarios
MM-CamObj: A Comprehensive Multimodal Dataset for Camouflaged Object ScenariosAAAI Conference on Artificial Intelligence (AAAI), 2024
Jiacheng Ruan
Wenzhen Yuan
Zehao Lin
Ning Liao
Zhiyu Li
Feiyu Xiong
Ting Liu
Yuzhuo Fu
150
7
0
24 Sep 2024
Phantom of Latent for Large Language and Vision Models
Phantom of Latent for Large Language and Vision Models
Byung-Kwan Lee
Sangyun Chung
Chae Won Kim
Beomchan Park
Yong Man Ro
VLMLRM
157
10
0
23 Sep 2024
12
Next