v1v2 (latest)

Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V

17 October 2023

Jianwei Yang

ArXiv (abs)PDF HTML HuggingFace (28 upvotes)Github (1387★)

Papers citing "Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V"

50 / 168 papers shown

UI-E2I-Synth: Advancing GUI Grounding with Large-Scale Instruction SynthesisAnnual Meeting of the Association for Computational Linguistics (ACL), 2025

569

15 Apr 2025

RealWebAssist: A Benchmark for Long-Horizon Web Assistance with Real-World Users

362

14 Apr 2025

GeoNav: Empowering MLLMs with Explicit Geospatial Reasoning Abilities for Language-Goal Aerial Navigation

532

13 Apr 2025

Domain-Conditioned Scene Graphs for State-Grounded Task Planning

319

09 Apr 2025

Grounding Multimodal LLMs to Embodied Agents that Ask for Help with Reinforcement Learning

436

01 Apr 2025

A Survey of WebAgents: Towards Next-Generation AI Agents for Web Automation with Large Foundation Models

...

622

30 Mar 2025

Does Chain-of-Thought Reasoning Help Mobile GUI Agent? An Empirical Study

227

21 Mar 2025

M3: 3D-Spatial MultiModal MemoryInternational Conference on Learning Representations (ICLR), 2025

261

20 Mar 2025

UI-Vision: A Desktop-centric GUI Benchmark for Visual Perception and Interaction

...

1.2K

19 Mar 2025

MP-GUI: Modality Perception with MLLMs for GUI UnderstandingComputer Vision and Pattern Recognition (CVPR), 2025

348

18 Mar 2025

DeskVision: Large Scale Desktop Region Captioning for Advanced GUI Agents

360

14 Mar 2025

IMPACT: Intelligent Motion Planning with Acceptable Contact Trajectories via Vision-Language Models

Yiyang Ling

Karan Owalekar

Oluwatobiloba Adesanya

Erdem Bıyık

Daniel Seita

346

13 Mar 2025

Through the Magnifying Glass: Adaptive Perception Magnification for Hallucination-Free VLM Decoding

1.1K

13 Mar 2025

In-Context Defense in Computer Agents: An Empirical Study

334

12 Mar 2025

AgentDAM: Privacy Leakage Evaluation for Autonomous Web Agents

456

12 Mar 2025

When Large Vision-Language Model Meets Large Remote Sensing Imagery: Coarse-to-Fine Text-Guided Token Pruning

491

10 Mar 2025

UFO: A Unified Approach to Fine-grained Visual Perception via Open-ended Language Interface

455

03 Mar 2025

Introducing Visual Perception Token into Multimodal Large Language Model

334

24 Feb 2025

Programming with Pixels: Can Computer-Use Agents do Software Engineering?

Pranjal Aggarwal

Sean Welleck

363

24 Feb 2025

Symmetrical Visual Contrastive Optimization: Aligning Vision-Language Models with Minimal Contrastive ImagesAnnual Meeting of the Association for Computational Linguistics (ACL), 2025

409

19 Feb 2025

Evaluating the Robustness of Multimodal Agents Against Active Environmental Injection Attacks

281

18 Feb 2025

Magma: A Foundation Model for Multimodal AI AgentsComputer Vision and Pattern Recognition (CVPR), 2025

...

368

18 Feb 2025

SoFar: Language-Grounded Orientation Bridges Spatial Reasoning and Object Manipulation

...

459

18 Feb 2025

Explorer: Scaling Exploration-driven Web Trajectory Synthesis for Multimodal Web AgentsAnnual Meeting of the Association for Computational Linguistics (ACL), 2025

867

17 Feb 2025

Digi-Q: Learning Q-Value Functions for Training Device-Control Agents

318

13 Feb 2025

Articulate AnyMesh: Open-Vocabulary 3D Articulated Objects Modeling

771

04 Feb 2025

Embodied Scene Understanding for Vision Language Models via MetaVQAComputer Vision and Pattern Recognition (CVPR), 2025

327

17 Jan 2025

Tapping the Potential of Large Language Models as Recommender Systems: A Comprehensive Framework and Empirical AnalysisACM Transactions on Knowledge Discovery from Data (TKDD), 2024

454

17 Jan 2025

GPT4Scene: Understand 3D Scenes from Videos with Vision-Language Models

769

02 Jan 2025

From Pixels to Predicates: Learning Symbolic World Models via Pretrained Vision-Language Models

Leslie Pack Kaelbling

Leslie Kaelbling

LM&Ro

344

31 Dec 2024

Aria-UI: Visual Grounding for GUI InstructionsAnnual Meeting of the Association for Computational Linguistics (ACL), 2024

510

20 Dec 2024

CAD-Assistant: Tool-Augmented VLLMs as Generic CAD Task Solvers

Dimitrios Mallis

Ahmet Serdar Karadeniz

Sebastian Cavada

Danila Rukhovich

Niki Maria Foteinopoulou

K. Cherenkova

Anis Kacem

Djamila Aouada

608

18 Dec 2024

RelationField: Relate Anything in Radiance FieldsComputer Vision and Pattern Recognition (CVPR), 2024

414

18 Dec 2024

TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks

...

757

100

18 Dec 2024

Attention-driven GUI Grounding: Leveraging Pretrained Multimodal Large Language Models without Fine-TuningAAAI Conference on Artificial Intelligence (AAAI), 2024

305

14 Dec 2024

The BrowserGym Ecosystem for Web Agent Research

Thibault Le Sellier De Chezelles

...

2.0K

06 Dec 2024

ChatRex: Taming Multimodal LLM for Joint Perception and Understanding

562

27 Nov 2024

GUI Agents with Foundation Models: A Comprehensive Survey

...

Bin Wang

Chuhan Wu

Yasheng Wang

Ruiming Tang

Jianye Hao

LLMAG

487

07 Nov 2024

Attacking Vision-Language Computer Agents via Pop-upsAnnual Meeting of the Association for Computational Linguistics (ACL), 2024

435

04 Nov 2024

SPA-Bench: A Comprehensive Benchmark for SmartPhone Agent EvaluationInternational Conference on Learning Representations (ICLR), 2024

...

Yasheng Wang

Jun Wang

Youssef Attia El Hili

LLMAG

522

19 Oct 2024

Dynamic Multimodal Evaluation with Flexible Complexity by Vision-Language BootstrappingInternational Conference on Learning Representations (ICLR), 2024

442

11 Oct 2024

GSON: A Group-based Social Navigation Framework with Large Multimodal ModelIEEE Robotics and Automation Letters (RA-L), 2024

494

26 Sep 2024

FineCops-Ref: A new Dataset and Task for Fine-Grained Compositional Referring Expression ComprehensionConference on Empirical Methods in Natural Language Processing (EMNLP), 2024

390

23 Sep 2024

Vision Language Models Can Parse Floor Plan Maps

362

19 Sep 2024

Cross-domain Multi-step Thinking: Zero-shot Fine-grained Traffic Sign Recognition in the WildKnowledge-Based Systems (KBS), 2024

332

03 Sep 2024

EditScribe: Non-Visual Image Editing with Natural Language Verification LoopsInternational ACM SIGACCESS Conference on Computers and Accessibility (ASSETS), 2024

200

13 Aug 2024

VL-TGS: Trajectory Generation and Selection using Vision Language Models in Mapless Outdoor EnvironmentsIEEE Robotics and Automation Letters (RA-L), 2024

Daeun Song

Jing Liang

Xuesu Xiao

Dinesh Manocha

624

05 Aug 2024

ControlMLLM: Training-Free Visual Prompt Learning for Multimodal Large Language Models

Jiayi Ji

333

31 Jul 2024

AMEX: Android Multi-annotation Expo Dataset for Mobile GUI Agents

391

03 Jul 2024

Tree Search for Language Model Agents

421

120

01 Jul 2024