ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2311.06607
  4. Cited By
Monkey: Image Resolution and Text Label Are Important Things for Large
  Multi-modal Models

Monkey: Image Resolution and Text Label Are Important Things for Large Multi-modal Models

11 November 2023
Zhang Li
Biao Yang
Qiang Liu
Zhiyin Ma
Shuo Zhang
Jingxu Yang
Yabo Sun
Yuliang Liu
Xiang Bai
    MLLM
ArXivPDFHTML

Papers citing "Monkey: Image Resolution and Text Label Are Important Things for Large Multi-modal Models"

50 / 200 papers shown
Title
LLaVA-UHD v2: an MLLM Integrating High-Resolution Semantic Pyramid via Hierarchical Window Transformer
LLaVA-UHD v2: an MLLM Integrating High-Resolution Semantic Pyramid via Hierarchical Window Transformer
Yipeng Zhang
Y. Liu
Zonghao Guo
Yidan Zhang
Xuesong Yang
...
Yuan Yao
Zhiyuan Liu
Tat-Seng Chua
Maosong Sun
Maosong Sun
MLLM
VLM
81
0
0
18 Dec 2024
ChatRex: Taming Multimodal LLM for Joint Perception and Understanding
ChatRex: Taming Multimodal LLM for Joint Perception and Understanding
Qing Jiang
Gen Luo
Yuqin Yang
Yuda Xiong
Yihao Chen
Zhaoyang Zeng
Tianhe Ren
Lei Zhang
VLM
LRM
105
6
0
27 Nov 2024
DOGE: Towards Versatile Visual Document Grounding and Referring
DOGE: Towards Versatile Visual Document Grounding and Referring
Yinan Zhou
Yuxin Chen
Haokun Lin
Shuyu Yang
Li Zhu
Zhongang Qi
Chen Ma
Ying Shan
ObjD
76
2
0
26 Nov 2024
MOSABench: Multi-Object Sentiment Analysis Benchmark for Evaluating
  Multimodal Large Language Models Understanding of Complex Image
MOSABench: Multi-Object Sentiment Analysis Benchmark for Evaluating Multimodal Large Language Models Understanding of Complex Image
Shezheng Song
Chengxiang He
Shasha Li
Shan Zhao
Chengyu Wang
...
Xiaopeng Li
Qian Wan
Jun Ma
Jie Yu
Xiaoguang Mao
VLM
82
1
0
25 Nov 2024
FocusLLaVA: A Coarse-to-Fine Approach for Efficient and Effective Visual
  Token Compression
FocusLLaVA: A Coarse-to-Fine Approach for Efficient and Effective Visual Token Compression
Yuke Zhu
Chi Xie
Shuang Liang
Bo Zheng
Sheng Guo
64
8
0
21 Nov 2024
Panther: Illuminate the Sight of Multimodal LLMs with Instruction-Guided
  Visual Prompts
Panther: Illuminate the Sight of Multimodal LLMs with Instruction-Guided Visual Prompts
Honglin Li
Yuting Gao
Chenglu Zhu
Jingdong Chen
M. Yang
Lin Yang
MLLM
79
0
0
21 Nov 2024
SignEye: Traffic Sign Interpretation from Vehicle First-Person View
Chuang Yang
Xu Han
T. Han
Yuejiao Su
Junyu Gao
Hongyuan Zhang
Yi Wang
Lap-Pui Chau
77
2
0
18 Nov 2024
M3DocRAG: Multi-modal Retrieval is What You Need for Multi-page
  Multi-document Understanding
M3DocRAG: Multi-modal Retrieval is What You Need for Multi-page Multi-document Understanding
Jaemin Cho
Debanjan Mahata
Ozan Irsoy
Yujie He
Mohit Bansal
VLM
20
8
0
07 Nov 2024
PIP-MM: Pre-Integrating Prompt Information into Visual Encoding via
  Existing MLLM Structures
PIP-MM: Pre-Integrating Prompt Information into Visual Encoding via Existing MLLM Structures
Tianxiang Wu
Minxin Nie
Ziqiang Cao
MLLM
40
0
0
30 Oct 2024
MMDocBench: Benchmarking Large Vision-Language Models for Fine-Grained
  Visual Document Understanding
MMDocBench: Benchmarking Large Vision-Language Models for Fine-Grained Visual Document Understanding
Fengbin Zhu
Ziyang Liu
Xiang Yao Ng
Haohui Wu
W. Wang
Fuli Feng
Chao Wang
Huanbo Luan
Tat-Seng Chua
VLM
35
3
0
25 Oct 2024
EDGE: Enhanced Grounded GUI Understanding with Enriched
  Multi-Granularity Synthetic Data
EDGE: Enhanced Grounded GUI Understanding with Enriched Multi-Granularity Synthetic Data
Xuetian Chen
Hangcheng Li
Jiaqing Liang
Sihang Jiang
Deqing Yang
LLMAG
46
2
0
25 Oct 2024
R-CoT: Reverse Chain-of-Thought Problem Generation for Geometric
  Reasoning in Large Multimodal Models
R-CoT: Reverse Chain-of-Thought Problem Generation for Geometric Reasoning in Large Multimodal Models
Linger Deng
Yuliang Liu
Bohan Li
Dongliang Luo
Liang Wu
...
Ziyang Zhang
Gang Zhang
Errui Ding
Yingying Zhu
Xiang Bai
ReLM
3DV
LRM
26
10
0
23 Oct 2024
Griffon-G: Bridging Vision-Language and Vision-Centric Tasks via Large
  Multimodal Models
Griffon-G: Bridging Vision-Language and Vision-Centric Tasks via Large Multimodal Models
Yufei Zhan
Hongyin Zhao
Yousong Zhu
Fan Yang
Ming Tang
Jinqiao Wang
MLLM
43
1
0
21 Oct 2024
LLaVA-Ultra: Large Chinese Language and Vision Assistant for Ultrasound
LLaVA-Ultra: Large Chinese Language and Vision Assistant for Ultrasound
Xuechen Guo
Wenhao Chai
Shi-Yan Li
Gaoang Wang
31
5
0
19 Oct 2024
MotionBank: A Large-scale Video Motion Benchmark with Disentangled
  Rule-based Annotations
MotionBank: A Large-scale Video Motion Benchmark with Disentangled Rule-based Annotations
Liang Xu
Shaoyang Hua
Zili Lin
Yifan Liu
Feipeng Ma
Yichao Yan
Xin Jin
Xiaokang Yang
Wenjun Zeng
VGen
39
3
0
17 Oct 2024
MMFuser: Multimodal Multi-Layer Feature Fuser for Fine-Grained
  Vision-Language Understanding
MMFuser: Multimodal Multi-Layer Feature Fuser for Fine-Grained Vision-Language Understanding
Yue Cao
Yangzhou Liu
Zhe Chen
Guangchen Shi
Wenhai Wang
Danhuai Zhao
Tong Lu
41
5
0
15 Oct 2024
MCTBench: Multimodal Cognition towards Text-Rich Visual Scenes Benchmark
MCTBench: Multimodal Cognition towards Text-Rich Visual Scenes Benchmark
Bin Shan
Xiang Fei
Wei Shi
An-Lan Wang
Guozhi Tang
Lei Liao
Jingqun Tang
Xiang Bai
Can Huang
VLM
23
5
0
15 Oct 2024
MLLM can see? Dynamic Correction Decoding for Hallucination Mitigation
MLLM can see? Dynamic Correction Decoding for Hallucination Mitigation
Chenxi Wang
Xiang Chen
N. Zhang
Bozhong Tian
Haoming Xu
Shumin Deng
H. Chen
MLLM
LRM
29
4
0
15 Oct 2024
Spatial-Aware Efficient Projector for MLLMs via Multi-Layer Feature
  Aggregation
Spatial-Aware Efficient Projector for MLLMs via Multi-Layer Feature Aggregation
Shun Qian
Bingquan Liu
Chengjie Sun
Zhen Xu
Baoxun Wang
26
0
0
14 Oct 2024
MMCOMPOSITION: Revisiting the Compositionality of Pre-trained
  Vision-Language Models
MMCOMPOSITION: Revisiting the Compositionality of Pre-trained Vision-Language Models
Hang Hua
Yunlong Tang
Ziyun Zeng
Liangliang Cao
Zhengyuan Yang
Hangfeng He
Chenliang Xu
Jiebo Luo
VLM
CoGe
31
9
0
13 Oct 2024
Dynamic Multimodal Evaluation with Flexible Complexity by
  Vision-Language Bootstrapping
Dynamic Multimodal Evaluation with Flexible Complexity by Vision-Language Bootstrapping
Yue Yang
S. Zhang
Wenqi Shao
Kaipeng Zhang
Yi Bin
Yu Wang
Ping Luo
28
3
0
11 Oct 2024
R-Bench: Are your Large Multimodal Model Robust to Real-world
  Corruptions?
R-Bench: Are your Large Multimodal Model Robust to Real-world Corruptions?
Chunyi Li
J. Zhang
Zicheng Zhang
H. Wu
Yuan Tian
...
Guo Lu
Xiaohong Liu
Xiongkuo Min
Weisi Lin
Guangtao Zhai
AAML
39
3
0
07 Oct 2024
EMMA: Efficient Visual Alignment in Multi-Modal LLMs
EMMA: Efficient Visual Alignment in Multi-Modal LLMs
Sara Ghazanfari
Alexandre Araujo
P. Krishnamurthy
Siddharth Garg
Farshad Khorrami
VLM
52
1
0
02 Oct 2024
Leopard: A Vision Language Model For Text-Rich Multi-Image Tasks
Leopard: A Vision Language Model For Text-Rich Multi-Image Tasks
Mengzhao Jia
Wenhao Yu
Kaixin Ma
Tianqing Fang
Zhihan Zhang
Siru Ouyang
Hongming Zhang
Meng-Long Jiang
Dong Yu
VLM
29
5
0
02 Oct 2024
MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-tuning
MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-tuning
Haotian Zhang
Mingfei Gao
Zhe Gan
Philipp Dufter
Nina Wenzel
...
Haoxuan You
Zirui Wang
Afshin Dehghan
Peter Grasch
Yinfei Yang
VLM
MLLM
36
32
1
30 Sep 2024
World to Code: Multi-modal Data Generation via Self-Instructed
  Compositional Captioning and Filtering
World to Code: Multi-modal Data Generation via Self-Instructed Compositional Captioning and Filtering
Jiacong Wang
Bohong Wu
Haiyong Jiang
Xun Zhou
Xin Xiao
Haoyuan Guo
Jun Xiao
VLM
VGen
36
4
0
30 Sep 2024
Phantom of Latent for Large Language and Vision Models
Phantom of Latent for Large Language and Vision Models
Byung-Kwan Lee
Sangyun Chung
Chae Won Kim
Beomchan Park
Yong Man Ro
VLM
LRM
39
6
0
23 Sep 2024
AVG-LLaVA: A Large Multimodal Model with Adaptive Visual Granularity
AVG-LLaVA: A Large Multimodal Model with Adaptive Visual Granularity
Zhibin Lan
Liqiang Niu
Fandong Meng
Wenbo Li
Jie Zhou
Jinsong Su
VLM
30
2
0
20 Sep 2024
Oryx MLLM: On-Demand Spatial-Temporal Understanding at Arbitrary Resolution
Oryx MLLM: On-Demand Spatial-Temporal Understanding at Arbitrary Resolution
Zuyan Liu
Yuhao Dong
Ziwei Liu
Winston Hu
Jiwen Lu
Yongming Rao
ObjD
74
54
0
19 Sep 2024
Fit and Prune: Fast and Training-free Visual Token Pruning for
  Multi-modal Large Language Models
Fit and Prune: Fast and Training-free Visual Token Pruning for Multi-modal Large Language Models
Weihao Ye
Qiong Wu
Wenhao Lin
Yiyi Zhou
VLM
27
10
0
16 Sep 2024
Mitigating Hallucination in Visual-Language Models via Re-Balancing
  Contrastive Decoding
Mitigating Hallucination in Visual-Language Models via Re-Balancing Contrastive Decoding
Xiaoyu Liang
Jiayuan Yu
Lianrui Mu
Jiedong Zhuang
Jiaqi Hu
Yuchen Yang
Jiangnan Ye
Lu Lu
Jian Chen
Haoji Hu
VLM
35
2
0
10 Sep 2024
READoc: A Unified Benchmark for Realistic Document Structured Extraction
READoc: A Unified Benchmark for Realistic Document Structured Extraction
Zichao Li
Aizier Abulaiti
Y. Lu
Xuanang Chen
Jia Zheng
Hongyu Lin
Xianpei Han
Le Sun
27
3
0
08 Sep 2024
mPLUG-DocOwl2: High-resolution Compressing for OCR-free Multi-page
  Document Understanding
mPLUG-DocOwl2: High-resolution Compressing for OCR-free Multi-page Document Understanding
Anwen Hu
Haiyang Xu
Liang Zhang
Jiabo Ye
Ming Yan
Ji Zhang
Qin Jin
Fei Huang
Jingren Zhou
VLM
22
27
0
05 Sep 2024
AdaptVision: Dynamic Input Scaling in MLLMs for Versatile Scene
  Understanding
AdaptVision: Dynamic Input Scaling in MLLMs for Versatile Scene Understanding
Yonghui Wang
Wengang Zhou
Hao Feng
Houqiang Li
VLM
22
0
0
30 Aug 2024
Eagle: Exploring The Design Space for Multimodal LLMs with Mixture of Encoders
Eagle: Exploring The Design Space for Multimodal LLMs with Mixture of Encoders
Min Shi
Fuxiao Liu
Shihao Wang
Shijia Liao
Subhashree Radhakrishnan
...
Andrew Tao
Andrew Tao
Zhiding Yu
Guilin Liu
Guilin Liu
MLLM
23
53
0
28 Aug 2024
Platypus: A Generalized Specialist Model for Reading Text in Various
  Forms
Platypus: A Generalized Specialist Model for Reading Text in Various Forms
Peng Wang
Zhaohai Li
Jun Tang
Humen Zhong
Fei Huang
Zhibo Yang
Cong Yao
VLM
ObjD
31
2
0
27 Aug 2024
DocLayLLM: An Efficient Multi-modal Extension of Large Language Models for Text-rich Document Understanding
DocLayLLM: An Efficient Multi-modal Extension of Large Language Models for Text-rich Document Understanding
Wenhui Liao
Jiapeng Wang
Hongliang Li
Chengyu Wang
Jun Huang
Lianwen Jin
35
0
0
27 Aug 2024
RSTeller: Scaling Up Visual Language Modeling in Remote Sensing with Rich Linguistic Semantics from Openly Available Data and Large Language Models
RSTeller: Scaling Up Visual Language Modeling in Remote Sensing with Rich Linguistic Semantics from Openly Available Data and Large Language Models
Junyao Ge
Yang Zheng
Kaitai Guo
Jimin Liang
Jimin Liang
27
1
0
27 Aug 2024
Has Multimodal Learning Delivered Universal Intelligence in Healthcare?
  A Comprehensive Survey
Has Multimodal Learning Delivered Universal Intelligence in Healthcare? A Comprehensive Survey
Qika Lin
Yifan Zhu
Xin Mei
Ling Huang
Jingying Ma
Kai He
Zhen Peng
Erik Cambria
Mengling Feng
32
16
0
23 Aug 2024
MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans?
MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans?
Yi-Fan Zhang
Huanyu Zhang
Haochen Tian
Chaoyou Fu
Shuangqing Zhang
...
Qingsong Wen
Zhang Zhang
L. Wang
Rong Jin
Tieniu Tan
OffRL
52
36
0
23 Aug 2024
Building and better understanding vision-language models: insights and
  future directions
Building and better understanding vision-language models: insights and future directions
Hugo Laurençon
Andrés Marafioti
Victor Sanh
Léo Tronchon
VLM
34
60
0
22 Aug 2024
EE-MLLM: A Data-Efficient and Compute-Efficient Multimodal Large Language Model
EE-MLLM: A Data-Efficient and Compute-Efficient Multimodal Large Language Model
Feipeng Ma
Yizhou Zhou
Hebei Li
Zilong He
Siying Wu
Fengyun Rao
Siying Wu
Fengyun Rao
Yueyi Zhang
Xiaoyan Sun
29
3
0
21 Aug 2024
HiRED: Attention-Guided Token Dropping for Efficient Inference of
  High-Resolution Vision-Language Models in Resource-Constrained Environments
HiRED: Attention-Guided Token Dropping for Efficient Inference of High-Resolution Vision-Language Models in Resource-Constrained Environments
Kazi Hasan Ibn Arif
JinYi Yoon
Dimitrios S. Nikolopoulos
Hans Vandierendonck
Deepu John
Bo Ji
MLLM
VLM
30
14
0
20 Aug 2024
Visual Agents as Fast and Slow Thinkers
Visual Agents as Fast and Slow Thinkers
Guangyan Sun
Mingyu Jin
Zhenting Wang
Cheng-Long Wang
Siqi Ma
Qifan Wang
Ying Nian Wu
Ying Nian Wu
Dongfang Liu
Dongfang Liu
LLMAG
LRM
74
12
0
16 Aug 2024
A Training-Free Framework for Video License Plate Tracking and
  Recognition with Only One-Shot
A Training-Free Framework for Video License Plate Tracking and Recognition with Only One-Shot
Haoxuan Ding
Qi. Wang
Junyu Gao
Qiang Li
VLM
37
0
0
11 Aug 2024
MMIU: Multimodal Multi-image Understanding for Evaluating Large
  Vision-Language Models
MMIU: Multimodal Multi-image Understanding for Evaluating Large Vision-Language Models
Fanqing Meng
J. Wang
Chuanhao Li
Quanfeng Lu
Hao Tian
...
Jifeng Dai
Yu Qiao
Ping Luo
Kaipeng Zhang
Wenqi Shao
VLM
50
17
0
05 Aug 2024
Mini-Monkey: Multi-Scale Adaptive Cropping for Multimodal Large Language
  Models
Mini-Monkey: Multi-Scale Adaptive Cropping for Multimodal Large Language Models
Mingxin Huang
Yuliang Liu
Dingkang Liang
Lianwen Jin
Xiang Bai
37
9
0
04 Aug 2024
Focus, Distinguish, and Prompt: Unleashing CLIP for Efficient and
  Flexible Scene Text Retrieval
Focus, Distinguish, and Prompt: Unleashing CLIP for Efficient and Flexible Scene Text Retrieval
Gangyan Zeng
Yuan Zhang
Jin Wei
Dongbao Yang
Peng Zhang
Yiwen Gao
Xugong Qin
Yu Zhou
VLM
CLIP
13
0
0
01 Aug 2024
WAS: Dataset and Methods for Artistic Text Segmentation
WAS: Dataset and Methods for Artistic Text Segmentation
Xudong Xie
Yuzhe Li
Yang Liu
Zhifei Zhang
Zhaowen Wang
Wei Xiong
Xiang Bai
DiffM
36
2
0
31 Jul 2024
Paying More Attention to Image: A Training-Free Method for Alleviating
  Hallucination in LVLMs
Paying More Attention to Image: A Training-Free Method for Alleviating Hallucination in LVLMs
Shiping Liu
Kecheng Zheng
Wei Chen
MLLM
41
33
0
31 Jul 2024
Previous
1234
Next