ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2311.06607
  4. Cited By
Monkey: Image Resolution and Text Label Are Important Things for Large
  Multi-modal Models

Monkey: Image Resolution and Text Label Are Important Things for Large Multi-modal Models

11 November 2023
Zhang Li
Biao Yang
Qiang Liu
Zhiyin Ma
Shuo Zhang
Jingxu Yang
Yabo Sun
Yuliang Liu
Xiang Bai
    MLLM
ArXivPDFHTML

Papers citing "Monkey: Image Resolution and Text Label Are Important Things for Large Multi-modal Models"

50 / 200 papers shown
Title
What matters when building vision-language models?
What matters when building vision-language models?
Hugo Laurençon
Léo Tronchon
Matthieu Cord
Victor Sanh
VLM
30
155
0
03 May 2024
VimTS: A Unified Video and Image Text Spotter for Enhancing the
  Cross-domain Generalization
VimTS: A Unified Video and Image Text Spotter for Enhancing the Cross-domain Generalization
Yuliang Liu
Mingxin Huang
Hao Yan
Linger Deng
Weijia Wu
Hao Lu
Chunhua Shen
Lianwen Jin
Xiang Bai
25
0
0
30 Apr 2024
How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal
  Models with Open-Source Suites
How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites
Zhe Chen
Weiyun Wang
Hao Tian
Shenglong Ye
Zhangwei Gao
...
Tong Lu
Dahua Lin
Yu Qiao
Jifeng Dai
Wenhai Wang
MLLM
VLM
49
513
0
25 Apr 2024
Interpreting COVID Lateral Flow Tests' Results with Foundation Models
Interpreting COVID Lateral Flow Tests' Results with Foundation Models
Stuti Pandey
Josh Myers-Dean
Jarek Reynolds
Danna Gurari
26
0
0
21 Apr 2024
TextSquare: Scaling up Text-Centric Visual Instruction Tuning
TextSquare: Scaling up Text-Centric Visual Instruction Tuning
Jingqun Tang
Chunhui Lin
Zhen Zhao
Shubo Wei
Binghong Wu
...
Yuliang Liu
Hao Liu
Yuan Xie
Xiang Bai
Can Huang
LRM
VLM
MLLM
61
28
0
19 Apr 2024
Look Before You Decide: Prompting Active Deduction of MLLMs for Assumptive Reasoning
Look Before You Decide: Prompting Active Deduction of MLLMs for Assumptive Reasoning
Yian Li
Wentao Tian
Yang Jiao
Jingjing Chen
Yueping Jiang
Bin Zhu
Na Zhao
Yu-Gang Jiang
LRM
35
9
0
19 Apr 2024
TextCoT: Zoom In for Enhanced Multimodal Text-Rich Image Understanding
TextCoT: Zoom In for Enhanced Multimodal Text-Rich Image Understanding
Bozhi Luan
Hao Feng
Hong Chen
Yonghui Wang
Wen-gang Zhou
Houqiang Li
MLLM
24
10
0
15 Apr 2024
TextHawk: Exploring Efficient Fine-Grained Perception of Multimodal
  Large Language Models
TextHawk: Exploring Efficient Fine-Grained Perception of Multimodal Large Language Models
Ya-Qi Yu
Minghui Liao
Jihao Wu
Yongxin Liao
Xiaoyu Zheng
Wei Zeng
VLM
19
15
0
14 Apr 2024
HRVDA: High-Resolution Visual Document Assistant
HRVDA: High-Resolution Visual Document Assistant
Chaohu Liu
Kun Yin
Haoyu Cao
Xinghua Jiang
Xin Li
Yinsong Liu
Deqiang Jiang
Xing Sun
Linli Xu
VLM
35
23
0
10 Apr 2024
InternLM-XComposer2-4KHD: A Pioneering Large Vision-Language Model
  Handling Resolutions from 336 Pixels to 4K HD
InternLM-XComposer2-4KHD: A Pioneering Large Vision-Language Model Handling Resolutions from 336 Pixels to 4K HD
Xiao-wen Dong
Pan Zhang
Yuhang Zang
Yuhang Cao
Bin Wang
...
Xingcheng Zhang
Jifeng Dai
Yuxin Qiao
Dahua Lin
Jiaqi Wang
VLM
MLLM
31
111
0
09 Apr 2024
Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs
Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs
Keen You
Haotian Zhang
E. Schoop
Floris Weers
Amanda Swearngin
Jeffrey Nichols
Yinfei Yang
Zhe Gan
MLLM
39
82
0
08 Apr 2024
Are We on the Right Way for Evaluating Large Vision-Language Models?
Are We on the Right Way for Evaluating Large Vision-Language Models?
Lin Chen
Jinsong Li
Xiao-wen Dong
Pan Zhang
Yuhang Zang
...
Haodong Duan
Jiaqi Wang
Yu Qiao
Dahua Lin
Feng Zhao
VLM
61
216
0
29 Mar 2024
Chain-of-Spot: Interactive Reasoning Improves Large Vision-Language
  Models
Chain-of-Spot: Interactive Reasoning Improves Large Vision-Language Models
Zuyan Liu
Yuhao Dong
Yongming Rao
Jie Zhou
Jiwen Lu
LRM
17
12
0
19 Mar 2024
mPLUG-DocOwl 1.5: Unified Structure Learning for OCR-free Document
  Understanding
mPLUG-DocOwl 1.5: Unified Structure Learning for OCR-free Document Understanding
Anwen Hu
Haiyang Xu
Jiabo Ye
Mingshi Yan
Liang Zhang
...
Chen Li
Ji Zhang
Qin Jin
Fei Huang
Jingren Zhou
VLM
45
104
0
19 Mar 2024
LLaVA-UHD: an LMM Perceiving Any Aspect Ratio and High-Resolution Images
LLaVA-UHD: an LMM Perceiving Any Aspect Ratio and High-Resolution Images
Ruyi Xu
Yuan Yao
Zonghao Guo
Junbo Cui
Zanlin Ni
Chunjiang Ge
Tat-Seng Chua
Zhiyuan Liu
Maosong Sun
Gao Huang
VLM
MLLM
21
102
0
18 Mar 2024
MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training
MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training
Brandon McKinzie
Zhe Gan
J. Fauconnier
Sam Dodge
Bowen Zhang
...
Zirui Wang
Ruoming Pang
Peter Grasch
Alexander Toshev
Yinfei Yang
MLLM
27
185
0
14 Mar 2024
Griffon v2: Advancing Multimodal Perception with High-Resolution Scaling
  and Visual-Language Co-Referring
Griffon v2: Advancing Multimodal Perception with High-Resolution Scaling and Visual-Language Co-Referring
Yufei Zhan
Yousong Zhu
Hongyin Zhao
Fan Yang
Ming Tang
Jinqiao Wang
ObjD
31
12
0
14 Mar 2024
Large Language Models and Causal Inference in Collaboration: A Survey
Large Language Models and Causal Inference in Collaboration: A Survey
Xiaoyu Liu
Paiheng Xu
Junda Wu
Jiaxin Yuan
Yifan Yang
...
Haoliang Wang
Tong Yu
Julian McAuley
Wei Ai
Furong Huang
ELM
LRM
72
35
0
14 Mar 2024
An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference
  Acceleration for Large Vision-Language Models
An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language Models
Liang Chen
Haozhe Zhao
Tianyu Liu
Shuai Bai
Junyang Lin
Chang Zhou
Baobao Chang
MLLM
VLM
48
114
0
11 Mar 2024
CAT: Enhancing Multimodal Large Language Model to Answer Questions in
  Dynamic Audio-Visual Scenarios
CAT: Enhancing Multimodal Large Language Model to Answer Questions in Dynamic Audio-Visual Scenarios
Qilang Ye
Zitong Yu
Rui Shao
Xinyu Xie
Philip H. S. Torr
Xiaochun Cao
MLLM
33
24
0
07 Mar 2024
TextMonkey: An OCR-Free Large Multimodal Model for Understanding
  Document
TextMonkey: An OCR-Free Large Multimodal Model for Understanding Document
Yuliang Liu
Biao Yang
Qiang Liu
Zhang Li
Zhiyin Ma
Shuo Zhang
Xiang Bai
MLLM
VLM
36
87
0
07 Mar 2024
Android in the Zoo: Chain-of-Action-Thought for GUI Agents
Android in the Zoo: Chain-of-Action-Thought for GUI Agents
Jiwen Zhang
Jihao Wu
Yihua Teng
Minghui Liao
Nuo Xu
Xiao Xiao
Zhongyu Wei
Duyu Tang
LLMAG
LM&Ro
27
50
0
05 Mar 2024
RoboCodeX: Multimodal Code Generation for Robotic Behavior Synthesis
RoboCodeX: Multimodal Code Generation for Robotic Behavior Synthesis
Yao Mu
Junting Chen
Qinglong Zhang
Shoufa Chen
Qiaojun Yu
...
Wenhai Wang
Jifeng Dai
Yu Qiao
Mingyu Ding
Ping Luo
37
20
0
25 Feb 2024
DualFocus: Integrating Macro and Micro Perspectives in Multi-modal Large
  Language Models
DualFocus: Integrating Macro and Micro Perspectives in Multi-modal Large Language Models
Yuhang Cao
Pan Zhang
Xiao-wen Dong
Dahua Lin
Jiaqi Wang
32
10
0
22 Feb 2024
Uncertainty-Aware Evaluation for Vision-Language Models
Uncertainty-Aware Evaluation for Vision-Language Models
Vasily Kostumov
Bulat Nutfullin
Oleg Pilipenko
Eugene Ilyushin
ELM
40
7
0
22 Feb 2024
The Revolution of Multimodal Large Language Models: A Survey
The Revolution of Multimodal Large Language Models: A Survey
Davide Caffagni
Federico Cocchi
Luca Barsellotti
Nicholas Moratelli
Sara Sarto
Lorenzo Baraldi
Lorenzo Baraldi
Marcella Cornia
Rita Cucchiara
LRM
VLM
46
41
0
19 Feb 2024
ChartX & ChartVLM: A Versatile Benchmark and Foundation Model for Complicated Chart Reasoning
ChartX & ChartVLM: A Versatile Benchmark and Foundation Model for Complicated Chart Reasoning
Renqiu Xia
Bo-Wen Zhang
Hancheng Ye
Xiangchao Yan
Qi Liu
...
Min Dou
Botian Shi
Junchi Yan
Junchi Yan
Yu Qiao
LRM
53
52
0
19 Feb 2024
CoLLaVO: Crayon Large Language and Vision mOdel
CoLLaVO: Crayon Large Language and Vision mOdel
Byung-Kwan Lee
Beomchan Park
Chae Won Kim
Yonghyun Ro
VLM
MLLM
19
16
0
17 Feb 2024
ScreenAgent: A Vision Language Model-driven Computer Control Agent
ScreenAgent: A Vision Language Model-driven Computer Control Agent
Runliang Niu
Jindong Li
Shiqi Wang
Yali Fu
Xiyu Hu
Xueyuan Leng
He Kong
Yi Chang
Qi Wang
LLMAG
MLLM
LM&Ro
55
37
0
09 Feb 2024
A Survey on Hallucination in Large Vision-Language Models
A Survey on Hallucination in Large Vision-Language Models
Hanchao Liu
Wenyuan Xue
Yifei Chen
Dapeng Chen
Xiutian Zhao
Ke Wang
Liping Hou
Rong-Zhi Li
Wei Peng
LRM
MLLM
14
110
0
01 Feb 2024
MouSi: Poly-Visual-Expert Vision-Language Models
MouSi: Poly-Visual-Expert Vision-Language Models
Xiaoran Fan
Tao Ji
Changhao Jiang
Shuo Li
Senjie Jin
...
Qi Zhang
Xipeng Qiu
Xuanjing Huang
Zuxuan Wu
Yunchun Jiang
VLM
21
16
0
30 Jan 2024
InternLM-XComposer2: Mastering Free-form Text-Image Composition and
  Comprehension in Vision-Language Large Model
InternLM-XComposer2: Mastering Free-form Text-Image Composition and Comprehension in Vision-Language Large Model
Xiao-wen Dong
Pan Zhang
Yuhang Zang
Yuhang Cao
Bin Wang
...
Conghui He
Xingcheng Zhang
Yu Qiao
Dahua Lin
Jiaqi Wang
VLM
MLLM
76
242
0
29 Jan 2024
MM-LLMs: Recent Advances in MultiModal Large Language Models
MM-LLMs: Recent Advances in MultiModal Large Language Models
Duzhen Zhang
Yahan Yu
Jiahua Dong
Chenxing Li
Dan Su
Chenhui Chu
Dong Yu
OffRL
LRM
37
173
0
24 Jan 2024
InternVL: Scaling up Vision Foundation Models and Aligning for Generic
  Visual-Linguistic Tasks
InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks
Zhe Chen
Jiannan Wu
Wenhai Wang
Weijie Su
Guo Chen
...
Bin Li
Ping Luo
Tong Lu
Yu Qiao
Jifeng Dai
VLM
MLLM
156
895
0
21 Dec 2023
X-Dreamer: Creating High-quality 3D Content by Bridging the Domain Gap
  Between Text-to-2D and Text-to-3D Generation
X-Dreamer: Creating High-quality 3D Content by Bridging the Domain Gap Between Text-to-2D and Text-to-3D Generation
Yiwei Ma
Yijun Fan
Jiayi Ji
Haowei Wang
Xiaoshuai Sun
Guannan Jiang
Annan Shu
Rongrong Ji
14
7
0
30 Nov 2023
Towards Improving Document Understanding: An Exploration on
  Text-Grounding via MLLMs
Towards Improving Document Understanding: An Exploration on Text-Grounding via MLLMs
Yonghui Wang
Wen-gang Zhou
Hao Feng
Keyi Zhou
Houqiang Li
50
18
0
22 Nov 2023
DocPedia: Unleashing the Power of Large Multimodal Model in the
  Frequency Domain for Versatile Document Understanding
DocPedia: Unleashing the Power of Large Multimodal Model in the Frequency Domain for Versatile Document Understanding
Hao Feng
Qi Liu
Hao Liu
Wen-gang Zhou
Houqiang Li
Can Huang
VLM
20
58
0
20 Nov 2023
mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with
  Modality Collaboration
mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration
Qinghao Ye
Haiyang Xu
Jiabo Ye
Mingshi Yan
Anwen Hu
Haowei Liu
Qi Qian
Ji Zhang
Fei Huang
Jingren Zhou
MLLM
VLM
116
367
0
07 Nov 2023
Woodpecker: Hallucination Correction for Multimodal Large Language
  Models
Woodpecker: Hallucination Correction for Multimodal Large Language Models
Shukang Yin
Chaoyou Fu
Sirui Zhao
Tong Bill Xu
Hao Wang
Dianbo Sui
Yunhang Shen
Ke Li
Xingguo Sun
Enhong Chen
VLM
MLLM
30
112
0
24 Oct 2023
Kosmos-2.5: A Multimodal Literate Model
Kosmos-2.5: A Multimodal Literate Model
Tengchao Lv
Yupan Huang
Jingye Chen
Lei Cui
Shuming Ma
...
Weiyao Luo
Shaoxiang Wu
Guoxin Wang
Cha Zhang
Furu Wei
VLM
MLLM
21
63
0
20 Sep 2023
MMBench: Is Your Multi-modal Model an All-around Player?
MMBench: Is Your Multi-modal Model an All-around Player?
Yuanzhan Liu
Haodong Duan
Yuanhan Zhang
Bo-wen Li
Songyang Zhang
...
Jiaqi Wang
Conghui He
Ziwei Liu
Kai-xiang Chen
Dahua Lin
19
898
0
12 Jul 2023
A Survey on Multimodal Large Language Models
A Survey on Multimodal Large Language Models
Shukang Yin
Chaoyou Fu
Sirui Zhao
Ke Li
Xing Sun
Tong Bill Xu
Enhong Chen
MLLM
LRM
33
551
0
23 Jun 2023
On the Hidden Mystery of OCR in Large Multimodal Models
On the Hidden Mystery of OCR in Large Multimodal Models
Yuliang Liu
Zhang Li
Mingxin Huang
Chunyuan Li
Dezhi Peng
Mingyu Liu
Lianwen Jin
Xiang Bai
VLM
MLLM
18
47
0
13 May 2023
A Survey on Segment Anything Model (SAM): Vision Foundation Model Meets
  Prompt Engineering
A Survey on Segment Anything Model (SAM): Vision Foundation Model Meets Prompt Engineering
Chaoning Zhang
Fachrina Dewi Puspitasari
Sheng Zheng
Chenghao Li
Yu Qiao
...
Caiyan Qin
François Rameau
Lik-Hang Lee
Sung-Ho Bae
Choong Seon Hong
VLM
76
61
0
12 May 2023
LMEye: An Interactive Perception Network for Large Language Models
LMEye: An Interactive Perception Network for Large Language Models
Yunxin Li
Baotian Hu
Xinyu Chen
Lin Ma
Yong-mei Xu
M. Zhang
MLLM
VLM
25
24
0
05 May 2023
mPLUG-Owl: Modularization Empowers Large Language Models with
  Multimodality
mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality
Qinghao Ye
Haiyang Xu
Guohai Xu
Jiabo Ye
Ming Yan
...
Junfeng Tian
Qiang Qi
Ji Zhang
Feiyan Huang
Jingren Zhou
VLM
MLLM
203
883
0
27 Apr 2023
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image
  Encoders and Large Language Models
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
Junnan Li
Dongxu Li
Silvio Savarese
Steven C. H. Hoi
VLM
MLLM
244
4,186
0
30 Jan 2023
Pix2Struct: Screenshot Parsing as Pretraining for Visual Language
  Understanding
Pix2Struct: Screenshot Parsing as Pretraining for Visual Language Understanding
Kenton Lee
Mandar Joshi
Iulia Turc
Hexiang Hu
Fangyu Liu
Julian Martin Eisenschlos
Urvashi Khandelwal
Peter Shaw
Ming-Wei Chang
Kristina Toutanova
CLIP
VLM
154
259
0
07 Oct 2022
Learn to Explain: Multimodal Reasoning via Thought Chains for Science
  Question Answering
Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering
Pan Lu
Swaroop Mishra
Tony Xia
Liang Qiu
Kai-Wei Chang
Song-Chun Zhu
Oyvind Tafjord
Peter Clark
A. Kalyan
ELM
ReLM
LRM
207
1,089
0
20 Sep 2022
BLIP: Bootstrapping Language-Image Pre-training for Unified
  Vision-Language Understanding and Generation
BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation
Junnan Li
Dongxu Li
Caiming Xiong
S. Hoi
MLLM
BDL
VLM
CLIP
385
4,010
0
28 Jan 2022
Previous
1234