ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2310.11441
  4. Cited By
Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V
v1v2 (latest)

Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V

17 October 2023
Jianwei Yang
Hao Zhang
Feng Li
Xueyan Zou
Chun-yue Li
Jianfeng Gao
    MLLMVLM
ArXiv (abs)PDFHTMLHuggingFace (28 upvotes)Github (1387★)

Papers citing "Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V"

50 / 163 papers shown
Title
MarketGen: A Scalable Simulation Platform with Auto-Generated Embodied Supermarket Environments
MarketGen: A Scalable Simulation Platform with Auto-Generated Embodied Supermarket Environments
Xu Hu
Yiyang Feng
Junran Peng
Jiawei He
L. Chen
Chuanchen Luo
Xucheng Yin
Qing Li
Zhaoxiang Zhang
LM&Ro
96
0
0
26 Nov 2025
SpatialBench: Benchmarking Multimodal Large Language Models for Spatial Cognition
SpatialBench: Benchmarking Multimodal Large Language Models for Spatial Cognition
Peiran Xu
Sudong Wang
Yao Zhu
Jianing Li
Yunjian Zhang
LRM
234
0
0
26 Nov 2025
OpenApps: Simulating Environment Variations to Measure UI-Agent Reliability
OpenApps: Simulating Environment Variations to Measure UI-Agent Reliability
Karen Ullrich
Jingtong Su
Claudia Shi
Arjun Subramonian
Amir Bar
Ivan Evtimov
Nikolaos Tsilivis
Randall Balestriero
Julia Kempe
Mark Ibrahim
64
0
0
25 Nov 2025
Computer-Use Agents as Judges for Generative User Interface
Computer-Use Agents as Judges for Generative User Interface
Kevin Qinghong Lin
Siyuan Hu
Linjie Li
Zhengyuan Yang
Lijuan Wang
Philip Torr
Mike Zheng Shou
ELM
67
0
0
19 Nov 2025
Octopus: Agentic Multimodal Reasoning with Six-Capability Orchestration
Octopus: Agentic Multimodal Reasoning with Six-Capability Orchestration
Yifu Guo
Zishan Xu
Zhiyuan Yao
Y. Lu
Jiaye Lin
Sen Hu
Zhenheng Tang
Y. Li
Huacan Wang
LRM
125
0
0
19 Nov 2025
ZeroDexGrasp: Zero-Shot Task-Oriented Dexterous Grasp Synthesis with Prompt-Based Multi-Stage Semantic Reasoning
ZeroDexGrasp: Zero-Shot Task-Oriented Dexterous Grasp Synthesis with Prompt-Based Multi-Stage Semantic Reasoning
Juntao Jian
Yi-Lin Wei
Chengjie Mou
Yuhao Lin
Xing Zhu
Yujun Shen
Wei-Shi Zheng
Ruizhen Hu
106
0
0
17 Nov 2025
An Efficient Training Pipeline for Reasoning Graphical User Interface Agents
An Efficient Training Pipeline for Reasoning Graphical User Interface Agents
Georgios Pantazopoulos
Eda B. Özyiğit
LRM
258
0
0
11 Nov 2025
RxnCaption: Reformulating Reaction Diagram Parsing as Visual Prompt Guided Captioning
RxnCaption: Reformulating Reaction Diagram Parsing as Visual Prompt Guided Captioning
Jiahe Song
C. Wang
Bowen Jiang
Y Samuel Wang
Hao Zheng
...
Y. Wang
Lijun Wu
Jiang Wu
Qian Yu
Conghui He
88
0
0
04 Nov 2025
SlideAgent: Hierarchical Agentic Framework for Multi-Page Visual Document Understanding
SlideAgent: Hierarchical Agentic Framework for Multi-Page Visual Document Understanding
Yiqiao Jin
Rachneet Kaur
Zhen Zeng
Sumitra Ganesh
Srijan Kumar
138
0
0
30 Oct 2025
Endowing GPT-4 with a Humanoid Body: Building the Bridge Between Off-the-Shelf VLMs and the Physical World
Endowing GPT-4 with a Humanoid Body: Building the Bridge Between Off-the-Shelf VLMs and the Physical World
Yingzhao Jian
Zhongan Wang
Yi Yang
Hehe Fan
LM&Ro
352
0
0
28 Oct 2025
Video-Thinker: Sparking "Thinking with Videos" via Reinforcement Learning
Video-Thinker: Sparking "Thinking with Videos" via Reinforcement Learning
Shijian Wang
Jiarui Jin
Xingjian Wang
L. Song
Runhao Fu
H. Wang
Zongyuan Ge
Yuan Lu
Xuelian Cheng
ReLMLRM
84
3
0
27 Oct 2025
PixelRefer: A Unified Framework for Spatio-Temporal Object Referring with Arbitrary Granularity
PixelRefer: A Unified Framework for Spatio-Temporal Object Referring with Arbitrary Granularity
Yuqian Yuan
W. Zhang
Xin Li
Shihao Wang
Kehan Li
Wentong Li
Jun Xiao
Lei Zhang
Beng Chin Ooi
ObjD
286
0
0
27 Oct 2025
GRAID: Enhancing Spatial Reasoning of VLMs Through High-Fidelity Data Generation
GRAID: Enhancing Spatial Reasoning of VLMs Through High-Fidelity Data Generation
Karim Elmaaroufi
Liheng Lai
Justin Svegliato
Yutong Bai
Sanjit A. Seshia
Matei A. Zaharia
142
0
0
25 Oct 2025
LightAgent: Mobile Agentic Foundation Models
LightAgent: Mobile Agentic Foundation Models
Yangqin Jiang
Chao Huang
LLMAG
98
0
0
24 Oct 2025
Scalable Vision-Language-Action Model Pretraining for Robotic Manipulation with Real-Life Human Activity Videos
Scalable Vision-Language-Action Model Pretraining for Robotic Manipulation with Real-Life Human Activity Videos
Qixiu Li
Yu Deng
Yaobo Liang
L. Luo
Lei Zhou
...
Hao Chen
Lily Sun
Dong Chen
J. Yang
B. Guo
101
3
0
24 Oct 2025
FocusAgent: Simple Yet Effective Ways of Trimming the Large Context of Web Agents
FocusAgent: Simple Yet Effective Ways of Trimming the Large Context of Web Agents
Imene Kerboua
Sahar Omidi Shayegan
Megh Thakkar
Xing Han Lù
Léo Boisvert
Massimo Caccia
Jérémy Espinas
Alexandre Aussem
Véronique Eglin
Alexandre Lacoste
LLMAG
92
0
0
03 Oct 2025
Just Do It!? Computer-Use Agents Exhibit Blind Goal-Directedness
Just Do It!? Computer-Use Agents Exhibit Blind Goal-Directedness
Erfan Shayegani
Keegan Hines
Yue Dong
Nael B. Abu-Ghazaleh
Roman Lutz
Spencer Whitehead
Vidhisha Balachandran
Besmira Nushi
Vibhav Vineet
108
0
0
02 Oct 2025
PAL-UI: Planning with Active Look-back for Vision-Based GUI Agents
PAL-UI: Planning with Active Look-back for Vision-Based GUI Agents
Zikang Liu
Junyi Li
Wayne Xin Zhao
Dawei Gao
Yaliang Li
Ji-Rong Wen
LLMAG
98
2
0
01 Oct 2025
Graph2Eval: Automatic Multimodal Task Generation for Agents via Knowledge Graphs
Graph2Eval: Automatic Multimodal Task Generation for Agents via Knowledge Graphs
Yurun Chen
Xavier Hu
Y. Liu
Ziqi Wang
Zeyi Liao
...
Feng Wei
Yuxi Qian
Bo Zheng
Keting Yin
Shengyu Zhang
LLMAG
189
1
0
01 Oct 2025
WALT: Web Agents that Learn Tools
WALT: Web Agents that Learn Tools
Viraj Prabhu
Yutong Dai
M. Fernández
Jing Gu
Krithika Ramakrishnan
...
Silvio Savarese
Caiming Xiong
Junnan Li
Zeyuan Chen
Ran Xu
LLMAGCLLKELM
90
0
0
01 Oct 2025
SCUBA: Salesforce Computer Use Benchmark
SCUBA: Salesforce Computer Use Benchmark
Yutong Dai
Krithika Ramakrishnan
Jing Gu
M. Fernández
Yanqi Luo
...
Zhenyu Hu
Silvio Savarese
Caiming Xiong
Zeyuan Chen
Ran Xu
ELM
123
1
0
30 Sep 2025
SQUARE: Semantic Query-Augmented Fusion and Efficient Batch Reranking for Training-free Zero-Shot Composed Image Retrieval
SQUARE: Semantic Query-Augmented Fusion and Efficient Batch Reranking for Training-free Zero-Shot Composed Image Retrieval
Ren-Di Wu
Yu-Yen Lin
Huei-Fang Yang
VLM
63
0
0
30 Sep 2025
Ferret-UI Lite: Lessons from Building Small On-Device GUI Agents
Ferret-UI Lite: Lessons from Building Small On-Device GUI Agents
Zhen Yang
Zi-Yi Dou
Di Feng
Forrest Huang
Anh Nguyen
...
Chao Jia
Jeffrey Nichols
Alexander Toshev
Yinfei Yang
Zhe Gan
LLMAG
91
2
0
30 Sep 2025
VLM-FO1: Bridging the Gap Between High-Level Reasoning and Fine-Grained Perception in VLMs
VLM-FO1: Bridging the Gap Between High-Level Reasoning and Fine-Grained Perception in VLMs
Peng Liu
H. Shen
Chunxin Fang
Zhicheng Sun
Jiajia Liao
T. Zhao
MLLMObjDVLMLRM
165
2
0
30 Sep 2025
DeepSketcher: Internalizing Visual Manipulation for Multimodal Reasoning
DeepSketcher: Internalizing Visual Manipulation for Multimodal Reasoning
Chi Zhang
Haibo Qiu
Qiming Zhang
Zhixiong Zeng
Lin Ma
Jing Zhang
VGenLRM
65
4
0
30 Sep 2025
IA-VLA: Input Augmentation for Vision-Language-Action models in settings with semantically complex tasks
IA-VLA: Input Augmentation for Vision-Language-Action models in settings with semantically complex tasks
Eric Hannus
Miika Malin
Tran Minh Son Le
Ville Kyrki
VLM
56
0
0
29 Sep 2025
PixelCraft: A Multi-Agent System for High-Fidelity Visual Reasoning on Structured Images
PixelCraft: A Multi-Agent System for High-Fidelity Visual Reasoning on Structured Images
Shuoshuo Zhang
Zijian Li
Yizhen Zhang
Jingjing Fu
Lei Song
Jiang Bian
Jun Zhang
Y. Yang
Rui Wang
LRM
80
1
0
29 Sep 2025
Mixture-of-Visual-Thoughts: Exploring Context-Adaptive Reasoning Mode Selection for General Visual Reasoning
Mixture-of-Visual-Thoughts: Exploring Context-Adaptive Reasoning Mode Selection for General Visual Reasoning
Zejun Li
Yingxiu Zhao
Jiwen Zhang
Siyuan Wang
Yang Yao
Runzhou Zhao
Jun Song
Bo Zheng
Zhongyu Wei
LRM
66
0
0
26 Sep 2025
UniPixel: Unified Object Referring and Segmentation for Pixel-Level Visual Reasoning
UniPixel: Unified Object Referring and Segmentation for Pixel-Level Visual Reasoning
Ye Liu
Zongyang Ma
Junfu Pu
Zhongang Qi
Yang Wu
Mingyu Ding
Chang Wen Chen
MLLMObjDLRM
263
2
0
22 Sep 2025
3D Aware Region Prompted Vision Language Model
3D Aware Region Prompted Vision Language Model
A. Cheng
Yang Fu
Yukang Chen
Zhijian Liu
X. Li
...
Jan Kautz
Pavlo Molchanov
Hongxu Yin
Xiaolong Wang
Sifei Liu
103
6
0
16 Sep 2025
Embodied Navigation Foundation Model
Embodied Navigation Foundation Model
JIazhao Zhang
Anqi Li
Yunpeng Qi
Minghan Li
Jiahang Liu
...
Zhibo Chen
Fei Gao
Qi Wu
Zhizheng Zhang
He Wang
LM&Ro
263
7
0
15 Sep 2025
Small Models, Big Results: Achieving Superior Intent Extraction through Decomposition
Small Models, Big Results: Achieving Superior Intent Extraction through Decomposition
Danielle Cohen
Yoni Halpern
Noam Kahlon
Joel Oren
Omri Berkovitch
Sapir Caduri
Ido Dagan
Anatoly Efros
72
0
0
15 Sep 2025
GLaVE-Cap: Global-Local Aligned Video Captioning with Vision Expert Integration
GLaVE-Cap: Global-Local Aligned Video Captioning with Vision Expert Integration
Wan Xu
Feng Zhu
Yihan Zeng
Yuanfan Guo
Ming-Yu Liu
Hang Xu
W. Zuo
48
0
0
14 Sep 2025
Realistic Environmental Injection Attacks on GUI Agents
Realistic Environmental Injection Attacks on GUI Agents
Yitong Zhang
Ximo Li
L. Cai
Jia Li
LLMAGAAML
77
2
0
14 Sep 2025
Towards Understanding Visual Grounding in Visual Language Models
Towards Understanding Visual Grounding in Visual Language Models
Georgios Pantazopoulos
Eda B. Özyiğit
ObjD
240
1
0
12 Sep 2025
RoboChemist: Long-Horizon and Safety-Compliant Robotic Chemical Experimentation
RoboChemist: Long-Horizon and Safety-Compliant Robotic Chemical Experimentation
Z. Zhang
Chenghao Yue
Haobo Xu
Minwen Liao
Xianglin Qi
Huan-ang Gao
Ziwei Wang
Hao Zhao
116
1
0
10 Sep 2025
SocialNav-SUB: Benchmarking VLMs for Scene Understanding in Social Robot Navigation
SocialNav-SUB: Benchmarking VLMs for Scene Understanding in Social Robot Navigation
M. Munje
Chen Tang
Shuijing Liu
Zichao Hu
Yifeng Zhu
Jiaxun Cui
Garrett A. Warnell
Joydeep Biswas
Peter Stone
96
2
0
10 Sep 2025
AI Agents for Web Testing: A Case Study in the Wild
AI Agents for Web Testing: A Case Study in the Wild
Naimeng Ye
Xiao Yu
Ruize Xu
Tianyi Peng
Zhou Yu
LLMAG
84
0
0
05 Sep 2025
Guideline-Consistent Segmentation via Multi-Agent Refinement
Guideline-Consistent Segmentation via Multi-Agent Refinement
Vanshika Vats
Ashwani Rathee
James Davis
VLM
160
0
0
04 Sep 2025
Strefer: Empowering Video LLMs with Space-Time Referring and Reasoning via Synthetic Instruction Data
Strefer: Empowering Video LLMs with Space-Time Referring and Reasoning via Synthetic Instruction Data
Honglu Zhou
Xiangyu Peng
Shrikant B. Kendre
Michael S Ryoo
Silvio Savarese
Caiming Xiong
Juan Carlos Niebles
84
0
0
03 Sep 2025
OmniActor: A Generalist GUI and Embodied Agent for 2D&3D Worlds
OmniActor: A Generalist GUI and Embodied Agent for 2D&3D Worlds
Longrong Yang
Zhixiong Zeng
Yufeng Zhong
Jing Huang
Liming Zheng
Lei Chen
Haibo Qiu
Zequn Qin
Lin Ma
Xi Li
LLMAGLM&Ro
75
2
0
02 Sep 2025
Measuring Image-Relation Alignment: Reference-Free Evaluation of VLMs and Synthetic Pre-training for Open-Vocabulary Scene Graph Generation
Measuring Image-Relation Alignment: Reference-Free Evaluation of VLMs and Synthetic Pre-training for Open-Vocabulary Scene Graph Generation
Maelic Neau
Zoe Falomir
Cédric Buche
Akihiro Sugimoto
64
0
0
01 Sep 2025
NetGent: Agent-Based Automation of Network Application Workflows
NetGent: Agent-Based Automation of Network Application Workflows
Jaber Daneshamooz
Eugene Vuong
Laasya Koduru
Sanjay Chandrasekaran
Arpit Gupta
72
1
0
30 Aug 2025
UItron: Foundational GUI Agent with Advanced Perception and Planning
UItron: Foundational GUI Agent with Advanced Perception and Planning
Zhixiong Zeng
Jing Huang
Liming Zheng
Wenkang Han
Yufeng Zhong
Lei Chen
Longrong Yang
Yingjie Chu
Yuzhi He
Lin Ma
LLMAG
165
4
0
29 Aug 2025
VoCap: Video Object Captioning and Segmentation from Any Prompt
VoCap: Video Object Captioning and Segmentation from Any Prompt
J. Uijlings
Xingyi Zhou
Xiuye Gu
Arsha Nagrani
Anurag Arnab
Alireza Fathi
David A. Ross
Cordelia Schmid
VOSVLM
188
1
0
29 Aug 2025
FineState-Bench: A Comprehensive Benchmark for Fine-Grained State Control in GUI Agents
FineState-Bench: A Comprehensive Benchmark for Fine-Grained State Control in GUI Agents
Fengxian Ji
Jingpu Yang
Zirui Song
Yuanxi Wang
Zhexuan Cui
Yuke Li
Qian Jiang
Miao Fang
Xiuying Chen
LLMAG
76
0
0
12 Aug 2025
Large Language Models for Power System Security: A Novel Multi-Modal Approach for Anomaly Detection in Energy Management Systems
Large Language Models for Power System Security: A Novel Multi-Modal Approach for Anomaly Detection in Energy Management Systems
Aydin Zaboli
Junho Hong
Alexandru Stefanov
Chen-Ching Liu
Chul-Sang Hwang
AAML
84
0
0
12 Aug 2025
Uncertainty-Aware GUI Agent: Adaptive Perception through Component Recommendation and Human-in-the-Loop Refinement
Uncertainty-Aware GUI Agent: Adaptive Perception through Component Recommendation and Human-in-the-Loop Refinement
Chao Hao
Shuai Wang
Kaiwen Zhou
138
7
0
06 Aug 2025
OS Agents: A Survey on MLLM-based Agents for General Computing Devices Use
OS Agents: A Survey on MLLM-based Agents for General Computing Devices Use
Xueyu Hu
Tao Xiong
Biao Yi
Zishu Wei
Ruixuan Xiao
...
Zhou Zhao
Hongxia Yang
Fan Wu
Shengyu Zhang
Fei Wu
LLMAGLM&RoAI4TS
198
29
0
06 Aug 2025
Decouple before Align: Visual Disentanglement Enhances Prompt Tuning
Decouple before Align: Visual Disentanglement Enhances Prompt TuningIEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2025
Fei Zhang
Tianfei Zhou
Jiangchao Yao
Ya Zhang
Ivor W. Tsang
Yanfeng Wang
176
5
0
01 Aug 2025
1234
Next