v1v2 (latest)

Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V

17 October 2023

Jianwei Yang

ArXiv (abs)PDF HTML HuggingFace (28 upvotes)Github (1387★)

Papers citing "Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V"

50 / 168 papers shown

MarketGen: A Scalable Simulation Platform with Auto-Generated Embodied Supermarket Environments

173

26 Nov 2025

SpatialBench: Benchmarking Multimodal Large Language Models for Spatial Cognition

352

26 Nov 2025

OpenApps: Simulating Environment Variations to Measure UI-Agent Reliability

125

25 Nov 2025

Octopus: Agentic Multimodal Reasoning with Six-Capability Orchestration

201

19 Nov 2025

Computer-Use Agents as Judges for Generative User Interface

107

19 Nov 2025

ZeroDexGrasp: Zero-Shot Task-Oriented Dexterous Grasp Synthesis with Prompt-Based Multi-Stage Semantic Reasoning

144

17 Nov 2025

An Efficient Training Pipeline for Reasoning Graphical User Interface Agents

Georgios Pantazopoulos

Eda B. Özyiğit

LRM

361

11 Nov 2025

RxnCaption: Reformulating Reaction Diagram Parsing as Visual Prompt Guided Captioning

...

Jiang Wu

Qian Yu

Conghui He

217

04 Nov 2025

SlideAgent: Hierarchical Agentic Framework for Multi-Page Visual Document Understanding

191

30 Oct 2025

Endowing GPT-4 with a Humanoid Body: Building the Bridge Between Off-the-Shelf VLMs and the Physical World

472

28 Oct 2025

Video-Thinker: Sparking "Thinking with Videos" via Reinforcement Learning

133

27 Oct 2025

PixelRefer: A Unified Framework for Spatio-Temporal Object Referring with Arbitrary Granularity

372

27 Oct 2025

GRAID: Enhancing Spatial Reasoning of VLMs Through High-Fidelity Data Generation

203

25 Oct 2025

LightAgent: Mobile Agentic Foundation Models

Yangqin Jiang

Chao Huang

LLMAG

120

24 Oct 2025

Scalable Vision-Language-Action Model Pretraining for Robotic Manipulation with Real-Life Human Activity Videos

...

130

24 Oct 2025

FocusAgent: Simple Yet Effective Ways of Trimming the Large Context of Web Agents

132

03 Oct 2025

Just Do It!? Computer-Use Agents Exhibit Blind Goal-Directedness

Vidhisha Balachandran

Besmira Nushi

Vibhav Vineet

148

02 Oct 2025

PAL-UI: Planning with Active Look-back for Vision-Based GUI Agents

166

01 Oct 2025

WALT: Web Agents that Learn Tools

Krithika Ramakrishnan

...

125

01 Oct 2025

Graph2Eval: Automatic Multimodal Task Generation for Agents via Knowledge Graphs

...

245

01 Oct 2025

SQUARE: Semantic Query-Augmented Fusion and Efficient Batch Reranking for Training-free Zero-Shot Composed Image Retrieval

137

30 Sep 2025

DeepSketcher: Internalizing Visual Manipulation for Multimodal Reasoning

111

30 Sep 2025

Ferret-UI Lite: Lessons from Building Small On-Device GUI Agents

...

171

30 Sep 2025

VLM-FO1: Bridging the Gap Between High-Level Reasoning and Fine-Grained Perception in VLMs

228

30 Sep 2025

SCUBA: Salesforce Computer Use Benchmark

Yutong Dai

Krithika Ramakrishnan

...

196

30 Sep 2025

IA-VLA: Input Augmentation for Vision-Language-Action models in settings with semantically complex tasks

29 Sep 2025

PixelCraft: A Multi-Agent System for High-Fidelity Visual Reasoning on Structured Images

153

29 Sep 2025

Mixture-of-Visual-Thoughts: Exploring Context-Adaptive Reasoning Mode Selection for General Visual Reasoning

146

26 Sep 2025

UniPixel: Unified Object Referring and Segmentation for Pixel-Level Visual Reasoning

392

22 Sep 2025

3D Aware Region Prompted Vision Language Model

...

139

16 Sep 2025

Small Models, Big Results: Achieving Superior Intent Extraction through Decomposition

119

15 Sep 2025

Embodied Navigation Foundation Model

...

422

15 Sep 2025

GLaVE-Cap: Global-Local Aligned Video Captioning with Vision Expert Integration

14 Sep 2025

Environmental Injection Attacks against GUI Agents in Realistic Dynamic Environments

122

14 Sep 2025

Towards Understanding Visual Grounding in Visual Language Models

Georgios Pantazopoulos

Eda B. Özyiğit

ObjD

326

12 Sep 2025

SocialNav-SUB: Benchmarking VLMs for Scene Understanding in Social Robot Navigation

145

10 Sep 2025

RoboChemist: Long-Horizon and Safety-Compliant Robotic Chemical Experimentation

151

10 Sep 2025

AI Agents for Web Testing: A Case Study in the Wild

142

05 Sep 2025

Guideline-Consistent Segmentation via Multi-Agent Refinement

234

04 Sep 2025

Strefer: Empowering Video LLMs with Space-Time Referring and Reasoning via Synthetic Instruction Data

130

03 Sep 2025

OmniActor: A Generalist GUI and Embodied Agent for 2D&3D Worlds

141

02 Sep 2025

Measuring Image-Relation Alignment: Reference-Free Evaluation of VLMs and Synthetic Pre-training for Open-Vocabulary Scene Graph Generation

124

01 Sep 2025

NetGent: Agent-Based Automation of Network Application Workflows

Jaber Daneshamooz

Eugene Vuong

Laasya Koduru

Sanjay Chandrasekaran

Arpit Gupta

112

30 Aug 2025

UItron: Foundational GUI Agent with Advanced Perception and Planning

196

29 Aug 2025

VoCap: Video Object Captioning and Segmentation from Any Prompt

261

29 Aug 2025

FineState-Bench: A Comprehensive Benchmark for Fine-Grained State Control in GUI Agents

130

12 Aug 2025

Large Language Models for Power System Security: A Novel Multi-Modal Approach for Anomaly Detection in Energy Management SystemsIEEE Access (IEEE Access), 2025

137

12 Aug 2025

Uncertainty-Aware GUI Agent: Adaptive Perception through Component Recommendation and Human-in-the-Loop Refinement

Chao Hao

Shuai Wang

Kaiwen Zhou

208

06 Aug 2025

OS Agents: A Survey on MLLM-based Agents for General Computing Devices Use

...

244

06 Aug 2025

Decouple before Align: Visual Disentanglement Enhances Prompt TuningIEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2025

246

01 Aug 2025