v1v2 (latest)

Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V

17 October 2023

Jianwei Yang

ArXiv (abs)PDF HTML HuggingFace (28 upvotes)Github (1387★)

Papers citing "Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V"

50 / 168 papers shown

Environmental Injection Attacks against GUI Agents in Realistic Dynamic Environments

122

03 Feb 2026

SpatialBench: Benchmarking Multimodal Large Language Models for Spatial Cognition

352

26 Nov 2025

MarketGen: A Scalable Simulation Platform with Auto-Generated Embodied Supermarket Environments

171

26 Nov 2025

OpenApps: Simulating Environment Variations to Measure UI-Agent Reliability

121

25 Nov 2025

Computer-Use Agents as Judges for Generative User Interface

104

19 Nov 2025

Octopus: Agentic Multimodal Reasoning with Six-Capability Orchestration

199

19 Nov 2025

ZeroDexGrasp: Zero-Shot Task-Oriented Dexterous Grasp Synthesis with Prompt-Based Multi-Stage Semantic Reasoning

144

17 Nov 2025

An Efficient Training Pipeline for Reasoning Graphical User Interface Agents

Georgios Pantazopoulos

Eda B. Özyiğit

LRM

356

11 Nov 2025

RxnCaption: Reformulating Reaction Diagram Parsing as Visual Prompt Guided Captioning

...

Jiang Wu

Qian Yu

Conghui He

217

04 Nov 2025

SlideAgent: Hierarchical Agentic Framework for Multi-Page Visual Document Understanding

191

30 Oct 2025

Endowing GPT-4 with a Humanoid Body: Building the Bridge Between Off-the-Shelf VLMs and the Physical World

470

28 Oct 2025

PixelRefer: A Unified Framework for Spatio-Temporal Object Referring with Arbitrary Granularity

372

27 Oct 2025

Video-Thinker: Sparking "Thinking with Videos" via Reinforcement Learning

133

27 Oct 2025

GRAID: Enhancing Spatial Reasoning of VLMs Through High-Fidelity Data Generation

203

25 Oct 2025

LightAgent: Mobile Agentic Foundation Models

Yangqin Jiang

Chao Huang

LLMAG

120

24 Oct 2025

Scalable Vision-Language-Action Model Pretraining for Robotic Manipulation with Real-Life Human Activity Videos

...

130

24 Oct 2025

FocusAgent: Simple Yet Effective Ways of Trimming the Large Context of Web Agents

132

03 Oct 2025

Just Do It!? Computer-Use Agents Exhibit Blind Goal-Directedness

Vidhisha Balachandran

Besmira Nushi

Vibhav Vineet

148

02 Oct 2025

PAL-UI: Planning with Active Look-back for Vision-Based GUI Agents

166

01 Oct 2025

WALT: Web Agents that Learn Tools

Krithika Ramakrishnan

...

124

01 Oct 2025

Graph2Eval: Automatic Multimodal Task Generation for Agents via Knowledge Graphs

...

243

01 Oct 2025

Ferret-UI Lite: Lessons from Building Small On-Device GUI Agents

...

171

30 Sep 2025

SQUARE: Semantic Query-Augmented Fusion and Efficient Batch Reranking for Training-free Zero-Shot Composed Image Retrieval

137

30 Sep 2025

DeepSketcher: Internalizing Visual Manipulation for Multimodal Reasoning

110

30 Sep 2025

VLM-FO1: Bridging the Gap Between High-Level Reasoning and Fine-Grained Perception in VLMs

222

30 Sep 2025

SCUBA: Salesforce Computer Use Benchmark

Yutong Dai

Krithika Ramakrishnan

...

196

30 Sep 2025

IA-VLA: Input Augmentation for Vision-Language-Action models in settings with semantically complex tasks

29 Sep 2025

PixelCraft: A Multi-Agent System for High-Fidelity Visual Reasoning on Structured Images

153

29 Sep 2025

Mixture-of-Visual-Thoughts: Exploring Context-Adaptive Reasoning Mode Selection for General Visual Reasoning

146

26 Sep 2025

UniPixel: Unified Object Referring and Segmentation for Pixel-Level Visual Reasoning

392

22 Sep 2025

3D Aware Region Prompted Vision Language Model

...

139

16 Sep 2025

Small Models, Big Results: Achieving Superior Intent Extraction through Decomposition

119

15 Sep 2025

Embodied Navigation Foundation Model

...

420

15 Sep 2025

GLaVE-Cap: Global-Local Aligned Video Captioning with Vision Expert Integration

14 Sep 2025

Towards Understanding Visual Grounding in Visual Language Models

Georgios Pantazopoulos

Eda B. Özyiğit

ObjD

324

12 Sep 2025

SocialNav-SUB: Benchmarking VLMs for Scene Understanding in Social Robot Navigation

145

10 Sep 2025

RoboChemist: Long-Horizon and Safety-Compliant Robotic Chemical Experimentation

151

10 Sep 2025

AI Agents for Web Testing: A Case Study in the Wild

141

05 Sep 2025

Guideline-Consistent Segmentation via Multi-Agent Refinement

233

04 Sep 2025

Strefer: Empowering Video LLMs with Space-Time Referring and Reasoning via Synthetic Instruction Data

130

03 Sep 2025

OmniActor: A Generalist GUI and Embodied Agent for 2D&3D Worlds

141

02 Sep 2025

Measuring Image-Relation Alignment: Reference-Free Evaluation of VLMs and Synthetic Pre-training for Open-Vocabulary Scene Graph Generation

124

01 Sep 2025

NetGent: Agent-Based Automation of Network Application Workflows

Jaber Daneshamooz

Eugene Vuong

Laasya Koduru

Sanjay Chandrasekaran

Arpit Gupta

112

30 Aug 2025

UItron: Foundational GUI Agent with Advanced Perception and Planning

196

29 Aug 2025

VoCap: Video Object Captioning and Segmentation from Any Prompt

261

29 Aug 2025

FineState-Bench: A Comprehensive Benchmark for Fine-Grained State Control in GUI Agents

130

12 Aug 2025

Large Language Models for Power System Security: A Novel Multi-Modal Approach for Anomaly Detection in Energy Management SystemsIEEE Access (IEEE Access), 2025

137

12 Aug 2025

Uncertainty-Aware GUI Agent: Adaptive Perception through Component Recommendation and Human-in-the-Loop Refinement

Chao Hao

Shuai Wang

Kaiwen Zhou

208

06 Aug 2025

OS Agents: A Survey on MLLM-based Agents for General Computing Devices Use

...

244

06 Aug 2025

Decouple before Align: Visual Disentanglement Enhances Prompt TuningIEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2025

246

01 Aug 2025