Communities
Connect sessions
AI calendar
Organizations
Join Slack
Contact Sales
Search
Open menu
Home
Papers
2310.11441
Cited By
v1
v2 (latest)
Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V
17 October 2023
Jianwei Yang
Hao Zhang
Feng Li
Xueyan Zou
Chun-yue Li
Jianfeng Gao
MLLM
VLM
Re-assign community
ArXiv (abs)
PDF
HTML
HuggingFace (28 upvotes)
Github (1387★)
Papers citing
"Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V"
50 / 168 papers shown
UI-E2I-Synth: Advancing GUI Grounding with Large-Scale Instruction Synthesis
Annual Meeting of the Association for Computational Linguistics (ACL), 2025
Xinyi Liu
Xiaoyi Zhang
Ziyun Zhang
Yan Lu
569
14
0
15 Apr 2025
RealWebAssist: A Benchmark for Long-Horizon Web Assistance with Real-World Users
Suyu Ye
Haojun Shi
Darren Shih
Hyokun Yun
Tanya Roosta
Tianmin Shu
362
11
0
14 Apr 2025
GeoNav: Empowering MLLMs with Explicit Geospatial Reasoning Abilities for Language-Goal Aerial Navigation
Haotian Xu
Yue Hu
Chen Gao
Zhengqiu Zhu
Yong Zhao
Yongqian Li
Quanjun Yin
532
5
0
13 Apr 2025
Domain-Conditioned Scene Graphs for State-Grounded Task Planning
Jonas Herzog
Jiangpin Liu
Yue Wang
LM&Ro
319
2
0
09 Apr 2025
Grounding Multimodal LLMs to Embodied Agents that Ask for Help with Reinforcement Learning
Ram Ramrakhya
Matthew Chang
Xavier Puig
Ruta Desai
Z. Kira
Roozbeh Mottaghi
LLMAG
LM&Ro
436
11
0
01 Apr 2025
A Survey of WebAgents: Towards Next-Generation AI Agents for Web Automation with Large Foundation Models
Liangbo Ning
Ziran Liang
Zhuohang Jiang
Haohao Qu
Yujuan Ding
...
Xiao Wei
Shanru Lin
Hui Liu
Philip S. Yu
Qing Li
LLMAG
LM&Ro
622
55
0
30 Mar 2025
Does Chain-of-Thought Reasoning Help Mobile GUI Agent? An Empirical Study
Li Zhang
Longxi Gao
Mengwei Xu
LRM
227
7
0
21 Mar 2025
M3: 3D-Spatial MultiModal Memory
International Conference on Learning Representations (ICLR), 2025
Xueyan Zou
Yuchen Song
Ri-Zhao Qiu
Xuanbin Peng
Jianglong Ye
Sifei Liu
Xiaolong Wang
3DGS
261
2
0
20 Mar 2025
UI-Vision: A Desktop-centric GUI Benchmark for Visual Perception and Interaction
Shravan Nayak
Xiangru Jian
Kevin Qinghong Lin
Juan A. Rodriguez
Montek Kalsi
...
David Vazquez
Christopher Pal
Perouz Taslakian
Spandana Gella
Sai Rajeswar
1.2K
30
0
19 Mar 2025
MP-GUI: Modality Perception with MLLMs for GUI Understanding
Computer Vision and Pattern Recognition (CVPR), 2025
Ziwei Wang
Weizhi Chen
Leyang Yang
Sheng Zhou
Shengchu Zhao
Hanbei Zhan
Jiongchao Jin
Liangcheng Li
Zirui Shao
Jiajun Bu
348
9
0
18 Mar 2025
DeskVision: Large Scale Desktop Region Captioning for Advanced GUI Agents
Yibin Xu
Liang Yang
Hao Chen
Hua Wang
Zhi Chen
Yaohua Tang
3DV
360
2
0
14 Mar 2025
IMPACT: Intelligent Motion Planning with Acceptable Contact Trajectories via Vision-Language Models
Yiyang Ling
Karan Owalekar
Oluwatobiloba Adesanya
Erdem Bıyık
Daniel Seita
346
5
0
13 Mar 2025
Through the Magnifying Glass: Adaptive Perception Magnification for Hallucination-Free VLM Decoding
Shunqi Mao
Chaoyi Zhang
Weidong Cai
MLLM
1.1K
4
0
13 Mar 2025
In-Context Defense in Computer Agents: An Empirical Study
Pei Yang
Hai Ci
Mike Zheng Shou
AAML
LLMAG
334
6
0
12 Mar 2025
AgentDAM: Privacy Leakage Evaluation for Autonomous Web Agents
Arman Zharmagambetov
Chuan Guo
Ivan Evtimov
Maya Pavlova
Ruslan Salakhutdinov
Kamalika Chaudhuri
LLMAG
456
27
0
12 Mar 2025
When Large Vision-Language Model Meets Large Remote Sensing Imagery: Coarse-to-Fine Text-Guided Token Pruning
Junwei Luo
Yingying Zhang
Xiaoyu Yang
Kang Wu
Qi Zhu
Lei Liang
Jingdong Chen
Yansheng Li
491
12
0
10 Mar 2025
UFO: A Unified Approach to Fine-grained Visual Perception via Open-ended Language Interface
Hao Tang
Chenwei Xie
Haiyang Wang
Xiaoyi Bao
Tingyu Weng
Nianzu Yang
Yun Zheng
Liwei Wang
ObjD
VLM
455
13
0
03 Mar 2025
Introducing Visual Perception Token into Multimodal Large Language Model
Runpeng Yu
Xinyin Ma
Xinchao Wang
MLLM
LRM
334
12
0
24 Feb 2025
Programming with Pixels: Can Computer-Use Agents do Software Engineering?
Pranjal Aggarwal
Sean Welleck
363
1
0
24 Feb 2025
Symmetrical Visual Contrastive Optimization: Aligning Vision-Language Models with Minimal Contrastive Images
Annual Meeting of the Association for Computational Linguistics (ACL), 2025
Shengguang Wu
Fan-Yun Sun
Kaiyue Wen
Nick Haber
VLM
409
8
0
19 Feb 2025
Evaluating the Robustness of Multimodal Agents Against Active Environmental Injection Attacks
Yurun Chen
Xavier Hu
Keting Yin
Juncheng Billy Li
Shengyu Zhang
AAML
281
13
0
18 Feb 2025
Magma: A Foundation Model for Multimodal AI Agents
Computer Vision and Pattern Recognition (CVPR), 2025
Jianwei Yang
Reuben Tan
Qianhui Wu
Ruijie Zheng
Baolin Peng
...
Seonghyeon Ye
Joel Jang
Yuquan Deng
Lars Liden
Jianfeng Gao
VLM
AI4TS
368
95
0
18 Feb 2025
SoFar: Language-Grounded Orientation Bridges Spatial Reasoning and Object Manipulation
Zekun Qi
Wenyao Zhang
Yufei Ding
Runpei Dong
Xinqiang Yu
...
Xin Jin
Kaisheng Ma
Zhizheng Zhang
He Wang
Li Yi
LM&Ro
459
15
0
18 Feb 2025
Explorer: Scaling Exploration-driven Web Trajectory Synthesis for Multimodal Web Agents
Annual Meeting of the Association for Computational Linguistics (ACL), 2025
Vardaan Pahuja
Yadong Lu
Corby Rosset
Boyu Gou
Arindam Mitra
Spencer Whitehead
Eric Fosler-Lussier
Ahmed Awadallah
LLMAG
LM&Ro
867
26
1
17 Feb 2025
Digi-Q: Learning Q-Value Functions for Training Device-Control Agents
Hao Bai
Yifei Zhou
Li Erran Li
Sergey Levine
Aviral Kumar
OffRL
318
11
0
13 Feb 2025
Articulate AnyMesh: Open-Vocabulary 3D Articulated Objects Modeling
Xiaowen Qiu
Jincheng Yang
Yian Wang
Zhehuan Chen
Yufei Wang
Tsun-Hsuan Wang
Zhou Xian
Chuang Gan
771
28
0
04 Feb 2025
Embodied Scene Understanding for Vision Language Models via MetaVQA
Computer Vision and Pattern Recognition (CVPR), 2025
Weizhen Wang
Chenda Duan
Zhenghao Peng
Yuxin Liu
Bolei Zhou
LM&Ro
327
9
0
17 Jan 2025
Tapping the Potential of Large Language Models as Recommender Systems: A Comprehensive Framework and Empirical Analysis
ACM Transactions on Knowledge Discovery from Data (TKDD), 2024
Lanling Xu
Junjie Zhang
Bingqian Li
Jinpeng Wang
Sheng Chen
Wayne Xin Zhao
Ji-Rong Wen
454
25
0
17 Jan 2025
GPT4Scene: Understand 3D Scenes from Videos with Vision-Language Models
Zhangyang Qi
Zhixiong Zhang
Ye Fang
Yuan Liu
Hengshuang Zhao
769
52
0
02 Jan 2025
From Pixels to Predicates: Learning Symbolic World Models via Pretrained Vision-Language Models
Ashay Athalye
Nishanth Kumar
Tom Silver
Yichao Liang
Tomás Lozano-Pérez
Leslie Pack Kaelbling
Leslie Kaelbling
LM&Ro
344
6
0
31 Dec 2024
Aria-UI: Visual Grounding for GUI Instructions
Annual Meeting of the Association for Computational Linguistics (ACL), 2024
Yuhao Yang
Yue Wang
Dongxu Li
Ziyang Luo
Bei Chen
Chenyu Huang
Junnan Li
LM&Ro
LLMAG
510
96
0
20 Dec 2024
CAD-Assistant: Tool-Augmented VLLMs as Generic CAD Task Solvers
Dimitrios Mallis
Ahmet Serdar Karadeniz
Sebastian Cavada
Danila Rukhovich
Niki Maria Foteinopoulou
K. Cherenkova
Anis Kacem
Djamila Aouada
608
16
0
18 Dec 2024
RelationField: Relate Anything in Radiance Fields
Computer Vision and Pattern Recognition (CVPR), 2024
Sebastian Koch
Johanna Wald
Mirco Colosi
Narunas Vaskevicius
Pedro Hermosilla
F. Tombari
Timo Ropinski
414
7
0
18 Dec 2024
TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks
Frank F. Xu
Yufan Song
Boxuan Li
Yuxuan Tang
Kritanjali Jain
...
Wayne Chi
Lawrence Jang
Yiqing Xie
Shuyan Zhou
Graham Neubig
ELM
757
100
0
18 Dec 2024
Attention-driven GUI Grounding: Leveraging Pretrained Multimodal Large Language Models without Fine-Tuning
AAAI Conference on Artificial Intelligence (AAAI), 2024
Hai-Ming Xu
Qi Chen
Lei Wang
Lingqiao Liu
305
9
0
14 Dec 2024
The BrowserGym Ecosystem for Web Agent Research
Thibault Le Sellier De Chezelles
Maxime Gasse
Alexandre Lacoste
Alexandre Drouin
Massimo Caccia
...
Siva Reddy
Quentin Cappart
Graham Neubig
Ruslan Salakhutdinov
Nicolas Chapados
LLMAG
2.0K
64
0
06 Dec 2024
ChatRex: Taming Multimodal LLM for Joint Perception and Understanding
Qing Jiang
Gen Luo
Yuqin Yang
Yuda Xiong
Yihao Chen
Zhaoyang Zeng
Tianhe Ren
Lei Zhang
VLM
LRM
562
22
0
27 Nov 2024
GUI Agents with Foundation Models: A Comprehensive Survey
Shuai Wang
Wen Liu
Jingxuan Chen
Weinan Gan
Xingshan Zeng
...
Bin Wang
Chuhan Wu
Yasheng Wang
Ruiming Tang
Jianye Hao
LLMAG
487
75
0
07 Nov 2024
Attacking Vision-Language Computer Agents via Pop-ups
Annual Meeting of the Association for Computational Linguistics (ACL), 2024
Yanzhe Zhang
Tao Yu
Diyi Yang
AAML
VLM
435
77
0
04 Nov 2024
SPA-Bench: A Comprehensive Benchmark for SmartPhone Agent Evaluation
International Conference on Learning Representations (ICLR), 2024
Jingxuan Chen
Derek Yuen
Bin Xie
Yue Yang
Gongwei Chen
...
Liqiang Nie
Yasheng Wang
Jianye Hao
Jun Wang
Youssef Attia El Hili
LLMAG
522
47
0
19 Oct 2024
Dynamic Multimodal Evaluation with Flexible Complexity by Vision-Language Bootstrapping
International Conference on Learning Representations (ICLR), 2024
Yue Yang
Shanghang Zhang
Wenqi Shao
Kaipeng Zhang
Yi Bin
Yu Wang
Ping Luo
442
15
0
11 Oct 2024
GSON: A Group-based Social Navigation Framework with Large Multimodal Model
IEEE Robotics and Automation Letters (RA-L), 2024
Shangyi Luo
Peng Sun
Ji Zhu
Yuhong Deng
Cunjun Yu
Anxing Xiao
Xueqian Wang
LM&Ro
494
6
0
26 Sep 2024
FineCops-Ref: A new Dataset and Task for Fine-Grained Compositional Referring Expression Comprehension
Conference on Empirical Methods in Natural Language Processing (EMNLP), 2024
Junzhuo Liu
Xiaohu Yang
Weiwei Li
Peng Wang
ObjD
390
13
0
23 Sep 2024
Vision Language Models Can Parse Floor Plan Maps
David DeFazio
Hrudayangam Mehta
Meng Wang
Ping Yang
Jeremy Blackburn
Shiqi Zhang
CoGe
362
5
0
19 Sep 2024
Cross-domain Multi-step Thinking: Zero-shot Fine-grained Traffic Sign Recognition in the Wild
Knowledge-Based Systems (KBS), 2024
Yaozong Gan
Guang Li
Ren Togo
Keisuke Maeda
Takahiro Ogawa
Miki Haseyama
332
1
0
03 Sep 2024
EditScribe: Non-Visual Image Editing with Natural Language Verification Loops
International ACM SIGACCESS Conference on Computers and Accessibility (ASSETS), 2024
Ruei-Che Chang
Yuxuan Liu
Lotus Zhang
Anhong Guo
DiffM
200
11
0
13 Aug 2024
VL-TGS: Trajectory Generation and Selection using Vision Language Models in Mapless Outdoor Environments
IEEE Robotics and Automation Letters (RA-L), 2024
Daeun Song
Jing Liang
Xuesu Xiao
Dinesh Manocha
624
20
0
05 Aug 2024
ControlMLLM: Training-Free Visual Prompt Learning for Multimodal Large Language Models
Ming-Kuan Wu
Xinyue Cai
Jiayi Ji
Jiale Li
Oucheng Huang
Gen Luo
Hao Fei
Xiaoshuai Sun
Rongrong Ji
MLLM
333
29
0
31 Jul 2024
AMEX: Android Multi-annotation Expo Dataset for Mobile GUI Agents
Yuxiang Chai
Siyuan Huang
Yazhe Niu
Han Xiao
Liang Liu
Dingyu Zhang
Shuai Ren
Hongsheng Li
LLMAG
391
83
0
03 Jul 2024
Tree Search for Language Model Agents
Jing Yu Koh
Alexander Shmakov
Daniel Fried
Ruslan Salakhutdinov
LRM
LM&Ro
LLMAG
421
120
0
01 Jul 2024
Previous
1
2
3
4
Next
Page 3 of 4