ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2026 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2505.00742
464
1
v1v2 (latest)

Zoomer: Adaptive Image Focus Optimization for Black-box MLLM

30 April 2025
Jiaxu Qian
Chendong Wang
Yue Yang
Chaoyun Zhang
Huiqiang Jiang
Xufang Luo
Yu Kang
Qingwei Lin
Anlan Zhang
Shiqi Jiang
Ting Cao
Tianjun Mao
Suman Banerjee
Guyue Liu
Saravan Rajmohan
Dongmei Zhang
Yifan Yang
Tao Gui
Lili Qiu
    VLM
ArXiv (abs)PDFHTML
Main:16 Pages
6 Figures
Bibliography:1 Pages
9 Tables
Appendix:1 Pages
Abstract

Multimodal large language models (MLLMs) such as GPT-4o, Gemini Pro, and Claude 3.5 have enabled unified reasoning over text and visual inputs, yet they often hallucinate in real world scenarios especially when small objects or fine spatial context are involved. We pinpoint two core causes of this failure: the absence of region-adaptive attention and inflexible token budgets that force uniform downsampling, leading to critical information loss. To overcome these limitations, we introduce Zoomer, a visual prompting framework that delivers token-efficient, detail-preserving image representations for black-box MLLMs. Zoomer integrates (1) a prompt-aware emphasis module to highlight semantically relevant regions, (2) a spatial-preserving orchestration schema to maintain object relationships, and (3) a budget-aware strategy to adaptively allocate tokens between global context and local details. Extensive experiments on nine benchmarks and three commercial MLLMs demonstrate that Zoomer boosts accuracy by up to 27% while cutting image token usage by up to 67%. Our approach establishes a principled methodology for robust, resource-aware multimodal understanding in settings where model internals are inaccessible.

View on arXiv
Comments on this paper