v1v2 (latest)

Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V

17 October 2023

Jianwei Yang

ArXiv (abs)PDF HTML HuggingFace (28 upvotes)Github (1387★)

Papers citing "Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V"

50 / 168 papers shown

Accessibility Scout: Personalized Accessibility Scans of Built EnvironmentsACM Symposium on User Interface Software and Technology (UIST), 2025

194

31 Jul 2025

Magentic-UI: Towards Human-in-the-loop Agentic Systems

...

224

30 Jul 2025

MapAgent: Trajectory-Constructed Memory-Augmented Planning for Mobile Task Automation

409

29 Jul 2025

Think, Act, Learn: A Framework for Autonomous Robotic Agents using Closed-Loop Large Language Models

256

26 Jul 2025

Object-centric Video Question Answering with Visual Grounding and Referring

267

25 Jul 2025

MagicGUI: A Foundational Mobile GUI Agent with Scalable Data Pipeline and Reinforcement Fine-tuning

...

435

19 Jul 2025

WebGuard: Building a Generalizable Guardrail for Web Agents

...

177

18 Jul 2025

ByDeWay: Boost Your multimodal LLM with DEpth prompting in a Training-Free Way

251

11 Jul 2025

3D-Generalist: Self-Improving Vision-Language-Action Models for Crafting 3D Worlds

...

222

09 Jul 2025

How Well Does GPT-4o Understand Vision? Evaluating Multimodal Foundation Models on Standard Computer Vision Tasks

217

02 Jul 2025

GenFlow: Interactive Modular System for Image Generation

189

26 Jun 2025

GraspMAS: Zero-Shot Language-driven Grasp Detection with Multi-Agent System

238

23 Jun 2025

AntiGrounding: Lifting Robotic Actions into VLM Representation Space for Decision Making

323

14 Jun 2025

Atomic-to-Compositional Generalization for Mobile Agents with A New Benchmark and Scheduling System

230

10 Jun 2025

EconWebArena: Benchmarking Autonomous Agents on Economic Tasks in Realistic Web Environments

Zefang Liu

Yinzhu Quan

217

09 Jun 2025

Multi-Step Visual Reasoning with Visual Tokens Scaling and Verification

206

08 Jun 2025

Contextual Experience Replay for Self-Improvement of Language AgentsAnnual Meeting of the Association for Computational Linguistics (ACL), 2025

288

07 Jun 2025

MATP-BENCH: Can MLLM Be a Good Automated Theorem Prover for Multimodal Problems?

241

06 Jun 2025

Neural Network Reprogrammability: A Unified Theme on Model Reprogramming, Prompt Tuning, and Prompt Instruction

657

05 Jun 2025

A Generative Adaptive Replay Continual Learning Model for Temporal Knowledge Graph ReasoningAnnual Meeting of the Association for Computational Linguistics (ACL), 2025

380

04 Jun 2025

Struct2D: A Perception-Guided Framework for Spatial Reasoning in MLLMs

464

04 Jun 2025

macOSWorld: A Multilingual Interactive Benchmark for GUI Agents

557

04 Jun 2025

Surfer-H Meets Holo1: Cost-Efficient Web Agent Powered by Open Weights

...

202

03 Jun 2025

OmniSpatial: Towards Comprehensive Spatial Reasoning Benchmark for Vision Language Models

372

03 Jun 2025

Grid-LOGAT: Grid Based Local and Global Area Transcription for Video Question AnsweringInternational Conference on Information Photonics (ICIP), 2025

315

30 May 2025

Understand, Think, and Answer: Advancing Visual Reasoning with Large Multimodal Models

282

27 May 2025

ScienceBoard: Evaluating Multimodal Autonomous Agents in Realistic Scientific Workflows

...

527

26 May 2025

Robot Operation of Home Appliances by Reading User Manuals

350

26 May 2025

ChartLens: Fine-grained Visual Attribution in ChartsAnnual Meeting of the Association for Computational Linguistics (ACL), 2025

212

25 May 2025

LA-RCS: LLM-Agent-Based Robot Control System

244

23 May 2025

InstructPart: Task-Oriented Part Segmentation with Instruction ReasoningAnnual Meeting of the Association for Computational Linguistics (ACL), 2025

203

23 May 2025

GUI-explorer: Autonomous Exploration and Mining of Transition-aware Knowledge for GUI AgentAnnual Meeting of the Association for Computational Linguistics (ACL), 2025

330

22 May 2025

Unlocking Smarter Device Control: Foresighted Planning with a World Model-Driven Code Execution Approach

419

22 May 2025

Plane Geometry Problem Solving with Multi-modal Reasoning: A Survey

274

20 May 2025

Scalable Video-to-Dataset Generation for Cross-Platform Mobile AgentsComputer Vision and Pattern Recognition (CVPR), 2025

241

19 May 2025

Disambiguating Reference in Visually Grounded Dialogues through Joint Modeling of Textual and Multimodal Semantic StructuresAnnual Meeting of the Association for Computational Linguistics (ACL), 2025

356

16 May 2025

Exploring Implicit Visual Misunderstandings in Multimodal Large Language Models through Attention Analysis

343

15 May 2025

ManipBench: Benchmarking Vision-Language Models for Low-Level Robot Manipulation

376

14 May 2025

Leveraging Vision-Language Models for Visual Grounding and Analysis of Automotive UI

Benjamin Raphael Ernhofer

Daniil Prokhorov

Jannica Langner

Dominik Bollmann

337

09 May 2025

EcoAgent: An Efficient Device-Cloud Collaborative Multi-Agent Framework for Mobile Automation

1.2K

08 May 2025

Do MLLMs Capture How Interfaces Guide User Behavior? A Benchmark for Multimodal UI/UX Design Understanding

567

08 May 2025

RESAnything: Attribute Prompting for Arbitrary Referring Segmentation

Ruiqi Wang

Hao Zhang

VLM

282

03 May 2025

Physics-Constrained Robot Grasp Planning for Dynamic Tool Use

Noah Trupin

Zixing Wang

A. H. Qureshi

288

02 May 2025

Robotic Visual InstructionComputer Vision and Pattern Recognition (CVPR), 2025

407

01 May 2025

Vision-Language Models Are Not Pragmatically Competent in Referring Expression Generation

512

22 Apr 2025

WASP: Benchmarking Web Agent Security Against Prompt Injection Attacks

515

22 Apr 2025

DRAWER: Digital Reconstruction and Articulation With Environment RealismComputer Vision and Pattern Recognition (CVPR), 2025

478

21 Apr 2025

UFO2: The Desktop AgentOS

...

721

20 Apr 2025

Locate 3D: Real-World Object Localization via Self-Supervised Learning in 3D

Krishna Murthy Jatavallabhula

...

287

19 Apr 2025

A0: An Affordance-Aware Hierarchical Model for General Robotic Manipulation

...

632

17 Apr 2025