Communities
Connect sessions
AI calendar
Organizations
Join Slack
Contact Sales
Search
Open menu
Home
Papers
All Papers
0 / 0 papers shown
Title
Home
Papers
2310.11441
Cited By
v1
v2 (latest)
Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V
17 October 2023
Jianwei Yang
Hao Zhang
Feng Li
Xueyan Zou
Chun-yue Li
Jianfeng Gao
MLLM
VLM
Re-assign community
ArXiv (abs)
PDF
HTML
HuggingFace (28 upvotes)
Github (1387★)
Papers citing
"Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V"
50 / 163 papers shown
Title
Accessibility Scout: Personalized Accessibility Scans of Built Environments
ACM Symposium on User Interface Software and Technology (UIST), 2025
William Huang
Xia Su
Jon E. Froehlich
Yang Zhang
123
1
0
31 Jul 2025
Magentic-UI: Towards Human-in-the-loop Agentic Systems
Hussein Mozannar
Gagan Bansal
Cheng Tan
Adam Fourney
Victor C. Dibia
...
Friederike Niedtner
Ece Kamar
Maya Murad
Rafah Hosn
Saleema Amershi
LLMAG
LM&Ro
138
13
0
30 Jul 2025
MapAgent: Trajectory-Constructed Memory-Augmented Planning for Mobile Task Automation
Yi Kong
Dianxi Shi
Guoli Yang
Zhang ke-di
Chenlin Huang
Xiaopeng Li
Songchang Jin
LLMAG
LM&Ro
321
3
0
29 Jul 2025
Think, Act, Learn: A Framework for Autonomous Robotic Agents using Closed-Loop Large Language Models
Anjali R. Menon
Rohit K. Sharma
Priya Singh
Chengyu Wang
Aurora M. Ferreira
Mateja Novak
LLMAG
LM&Ro
AI4CE
110
0
0
26 Jul 2025
Object-centric Video Question Answering with Visual Grounding and Referring
Haochen Wang
Qirui Chen
Cilin Yan
Jiayin Cai
Xiaolong Jiang
Yao Hu
Weidi Xie
Stratis Gavves
MLLM
VOS
192
4
0
25 Jul 2025
MagicGUI: A Foundational Mobile GUI Agent with Scalable Data Pipeline and Reinforcement Fine-tuning
Liujian Tang
Shaokang Dong
Y. Huang
Minqi Xiang
Hongtao Ruan
...
Qi Zhang
Kang Wang
Y. Zhang
Y. Wang
Yuran Wang
LM&Ro
341
6
0
19 Jul 2025
WebGuard: Building a Generalizable Guardrail for Web Agents
Boyuan Zheng
Zeyi Liao
Scott Salisbury
Zeyuan Liu
Michael Lin
...
Zifan Wang
Xiang Deng
Dawn Song
Huan Sun
Eric Fosler-Lussier
LLMAG
148
4
0
18 Jul 2025
ByDeWay: Boost Your multimodal LLM with DEpth prompting in a Training-Free Way
Rajarshi Roy
Devleena Das
A. Banerjee
Arjya Bhattacharjee
Kousik Dasgupta
Subarna Tripathi
VLM
200
0
0
11 Jul 2025
3D-Generalist: Self-Improving Vision-Language-Action Models for Crafting 3D Worlds
Fan-Yun Sun
Shengguang Wu
Christian Jacobsen
Thomas Yim
H. Zou
...
Valts Blukis
Jonathan Tremblay
Jiajun Wu
Stan Birchfield
Nick Haber
VGen
158
0
0
09 Jul 2025
How Well Does GPT-4o Understand Vision? Evaluating Multimodal Foundation Models on Standard Computer Vision Tasks
Rahul Ramachandran
Ali Garjani
Roman Bachmann
Andrei Atanov
Oğuzhan Fatih Kar
Amir Zamir
MLLM
VLM
LRM
186
10
0
02 Jul 2025
GenFlow: Interactive Modular System for Image Generation
Duc-Tien Dang-Nguyen
Huu-Phuc Huynh
Minh-Triet Tran
T. Le
138
0
0
26 Jun 2025
GraspMAS: Zero-Shot Language-driven Grasp Detection with Multi-Agent System
Quang H. Nguyen
T. H. Le
Huy Le Nguyen
T. Vo
Tung D. Ta
Baoru Huang
Minh Nhat Vu
Anh-Tien Nguyen
175
0
0
23 Jun 2025
AntiGrounding: Lifting Robotic Actions into VLM Representation Space for Decision Making
Wenbo Li
Shiyi Wang
Yiteng Chen
Huiping Zhuang
Qingyao Wu
235
0
0
14 Jun 2025
Atomic-to-Compositional Generalization for Mobile Agents with A New Benchmark and Scheduling System
Yuan Guo
Tingjia Miao
Zheng Wu
Pengzhou Cheng
Ming Zhou
Zhuosheng Zhang
174
6
0
10 Jun 2025
EconWebArena: Benchmarking Autonomous Agents on Economic Tasks in Realistic Web Environments
Zefang Liu
Yinzhu Quan
154
4
0
09 Jun 2025
Multi-Step Visual Reasoning with Visual Tokens Scaling and Verification
Tianyi Bai
Zengjie Hu
Fupeng Sun
Jiantao Qiu
Yizhen Jiang
Guangxin He
Bohan Zeng
Conghui He
Binhang Yuan
Wentao Zhang
OffRL
LRM
135
10
0
08 Jun 2025
Contextual Experience Replay for Self-Improvement of Language Agents
Annual Meeting of the Association for Computational Linguistics (ACL), 2025
Yitao Liu
Chenglei Si
Karthik Narasimhan
Shunyu Yao
LLMAG
196
9
0
07 Jun 2025
MATP-BENCH: Can MLLM Be a Good Automated Theorem Prover for Multimodal Problems?
Zhitao He
Zongwei Lyu
Dazhong Chen
Dadi Guo
Yi R. Fung
LRM
196
5
0
06 Jun 2025
Neural Network Reprogrammability: A Unified Theme on Model Reprogramming, Prompt Tuning, and Prompt Instruction
Zesheng Ye
C. Cai
Ruijiang Dong
Jianzhong Qi
Bingquan Shen
Pin-Yu Chen
Feng Liu
515
1
0
05 Jun 2025
macOSWorld: A Multilingual Interactive Benchmark for GUI Agents
Pei Yang
Hai Ci
Mike Zheng Shou
LLMAG
317
4
0
04 Jun 2025
Struct2D: A Perception-Guided Framework for Spatial Reasoning in MLLMs
Fangrui Zhu
Hanhui Wang
Yiming Xie
Jing Gu
Tianye Ding
Jianwei Yang
Huaizu Jiang
3DV
LRM
364
0
0
04 Jun 2025
A Generative Adaptive Replay Continual Learning Model for Temporal Knowledge Graph Reasoning
Annual Meeting of the Association for Computational Linguistics (ACL), 2025
Zhiyu Zhang
Wei Chen
Youfang Lin
Huaiyu Wan
OffRL
CLL
343
1
0
04 Jun 2025
OmniSpatial: Towards Comprehensive Spatial Reasoning Benchmark for Vision Language Models
Mengdi Jia
Zekun Qi
Shaochen Zhang
Wenyao Zhang
Xinqiang Yu
Jiawei He
He Wang
L. Yi
LRM
VLM
230
23
0
03 Jun 2025
Surfer-H Meets Holo1: Cost-Efficient Web Agent Powered by Open Weights
M. Andreux
Breno Baldas Skuk
Hamza Benchekroun
Emilien Biré
Antoine Bonnet
...
Marc Thibault
L. Thiry
Léo Tronchon
Nicolas Usunier
Tony Wu
LLMAG
172
0
0
03 Jun 2025
Grid-LOGAT: Grid Based Local and Global Area Transcription for Video Question Answering
International Conference on Information Photonics (ICIP), 2025
Md Intisar Chowdhury
Kittinun Aukkapinyo
Hiroshi Fujimura
Joo Ann Woo
Wasu Wasusatein
Fadoua Ghourabi
226
0
0
30 May 2025
Understand, Think, and Answer: Advancing Visual Reasoning with Large Multimodal Models
Yufei Zhan
Hongyin Zhao
Yousong Zhu
Shurong Zheng
Fan Yang
Ming Tang
Jinqiao Wang
VLM
LRM
239
1
0
27 May 2025
ScienceBoard: Evaluating Multimodal Autonomous Agents in Realistic Scientific Workflows
Qiushi Sun
Zhoumianze Liu
Chang Ma
Zichen Ding
Fangzhi Xu
...
B. Kao
Wenhai Wang
Biqing Qi
Lingpeng Kong
Zhiyong Wu
LLMAG
LM&Ro
444
11
0
26 May 2025
Robot Operation of Home Appliances by Reading User Manuals
Jian Zhang
Hanbo Zhang
Anxing Xiao
David Hsu
LM&Ro
255
1
0
26 May 2025
ChartLens: Fine-grained Visual Attribution in Charts
Annual Meeting of the Association for Computational Linguistics (ACL), 2025
Manan Suri
Puneet Mathur
Nedim Lipka
Franck Dernoncourt
Ryan Rossi
Dinesh Manocha
164
1
0
25 May 2025
LA-RCS: LLM-Agent-Based Robot Control System
TaekHyun Park
YoungJun Choi
SeungHoon Shin
Kwangil Lee
173
0
0
23 May 2025
InstructPart: Task-Oriented Part Segmentation with Instruction Reasoning
Annual Meeting of the Association for Computational Linguistics (ACL), 2025
Zifu Wan
Yaqi Xie
Ce Zhang
Zhiqiu Lin
Zihan Wang
Simon Stepputtis
Deva Ramanan
Katia Sycara
159
3
0
23 May 2025
GUI-explorer: Autonomous Exploration and Mining of Transition-aware Knowledge for GUI Agent
Annual Meeting of the Association for Computational Linguistics (ACL), 2025
Bin Xie
Rui Shao
Gongwei Chen
Kaiwen Zhou
Yinchuan Li
Jie Liu
Min Zhang
Liqiang Nie
LLMAG
233
13
0
22 May 2025
Unlocking Smarter Device Control: Foresighted Planning with a World Model-Driven Code Execution Approach
Xiaoran Yin
Xu Luo
Hao Wu
Lianli Gao
Jingkuan Song
311
1
0
22 May 2025
Plane Geometry Problem Solving with Multi-modal Reasoning: A Survey
Seunghyuk Cho
Zhenyue Qin
Yang Liu
Youngbin Choi
Seungbeom Lee
Dongwoo Kim
LRM
234
2
0
20 May 2025
Scalable Video-to-Dataset Generation for Cross-Platform Mobile Agents
Computer Vision and Pattern Recognition (CVPR), 2025
Yunseok Jang
Yeda Song
Sungryull Sohn
Lajanugen Logeswaran
Tiange Luo
Dong-Ki Kim
Kyunghoon Bae
Honglak Lee
VGen
201
3
0
19 May 2025
Disambiguating Reference in Visually Grounded Dialogues through Joint Modeling of Textual and Multimodal Semantic Structures
Annual Meeting of the Association for Computational Linguistics (ACL), 2025
Shun Inadumi
Nobuhiro Ueda
Koichiro Yoshino
ObjD
304
0
0
16 May 2025
Exploring Implicit Visual Misunderstandings in Multimodal Large Language Models through Attention Analysis
Pengfei Wang
Guohai Xu
Weinong Wang
Junjie Yang
Jie Lou
Yunhua Xue
294
1
0
15 May 2025
ManipBench: Benchmarking Vision-Language Models for Low-Level Robot Manipulation
Enyu Zhao
Vedant Raval
Hejia Zhang
Jiageng Mao
Zeyu Shangguan
Stefanos Nikolaidis
Yun Wang
Daniel Seita
LM&Ro
CoGe
278
10
0
14 May 2025
Leveraging Vision-Language Models for Visual Grounding and Analysis of Automotive UI
Benjamin Raphael Ernhofer
Daniil Prokhorov
Jannica Langner
Dominik Bollmann
241
1
0
09 May 2025
EcoAgent: An Efficient Device-Cloud Collaborative Multi-Agent Framework for Mobile Automation
Biao Yi
Xavier Hu
Yexin Chen
Shengyu Zhang
Hongxia Yang
Fan Wu
LLMAG
1.0K
3
0
08 May 2025
Do MLLMs Capture How Interfaces Guide User Behavior? A Benchmark for Multimodal UI/UX Design Understanding
Jaehyun Jeon
Janghan Yoon
Minsoo Kim
Sumin Shim
Yejin Choi
Hanbin Kim
Youngjae Yu
AAML
491
0
0
08 May 2025
RESAnything: Attribute Prompting for Arbitrary Referring Segmentation
Ruiqi Wang
Hao Zhang
VLM
229
2
0
03 May 2025
Physics-Constrained Robot Grasp Planning for Dynamic Tool Use
Noah Trupin
Zixing Wang
A. H. Qureshi
225
0
0
02 May 2025
Robotic Visual Instruction
Computer Vision and Pattern Recognition (CVPR), 2025
Yuchen Ren
Ziyang Gong
Haoyang Li
Xiaoqi Huang
Haolan Kang
Guangping Bai
Xianzheng Ma
LM&Ro
333
7
0
01 May 2025
WASP: Benchmarking Web Agent Security Against Prompt Injection Attacks
Ivan Evtimov
Arman Zharmagambetov
Aaron Grattafiori
Chuan Guo
Kamalika Chaudhuri
ELM
359
35
0
22 Apr 2025
Vision-Language Models Are Not Pragmatically Competent in Referring Expression Generation
Ziqiao Ma
Jing Ding
Xuejun Zhang
Dezhi Luo
Jiahe Ding
Sihan Xu
Yuchen Huang
Run Peng
Joyce Chai
420
3
0
22 Apr 2025
DRAWER: Digital Reconstruction and Articulation With Environment Realism
Computer Vision and Pattern Recognition (CVPR), 2025
Hongchi Xia
Entong Su
Marius Memmel
Arhan Jain
Raymond Yu
Numfor Mbiziwo-Tiapo
Ali Farhadi
Abhishek Gupta
Shenlong Wang
Wei-Chiu Ma
VGen
359
10
0
21 Apr 2025
UFO2: The Desktop AgentOS
Chaoyun Zhang
He Huang
Chiming Ni
J. Mu
Si Qin
...
Minghua Ma
Jian-Guang Lou
Qingwei Lin
Saravan Rajmohan
Dongmei Zhang
LLMAG
605
13
0
20 Apr 2025
Locate 3D: Real-World Object Localization via Self-Supervised Learning in 3D
Sergio Arnaud
Paul Mcvay
Ada Martin
Arjun Majumdar
Krishna Murthy Jatavallabhula
...
Nicolas Ballas
Mido Assran
Oleksandr Maksymets
Aravind Rajeswaran
Franziska Meier
3DPC
236
15
0
19 Apr 2025
A0: An Affordance-Aware Hierarchical Model for General Robotic Manipulation
Rongtao Xu
Junxuan Zhang
Minghao Guo
Youpeng Wen
H. Yang
...
Liqiong Wang
Yuxuan Kuang
Meng Cao
Feng Zheng
Xiaodan Liang
504
28
0
17 Apr 2025
Previous
1
2
3
4
Next