Communities
Connect sessions
AI calendar
Organizations
Join Slack
Contact Sales
Search
Open menu
Home
Papers
2010.04295
Cited By
Widget Captioning: Generating Natural Language Description for Mobile User Interface Elements
Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020
8 October 2020
Yongqian Li
Gang Li
Luheng He
Jingjie Zheng
Hong Li
Zhiwei Guan
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"Widget Captioning: Generating Natural Language Description for Mobile User Interface Elements"
50 / 78 papers shown
Grounding Computer Use Agents on Human Demonstrations
Aarash Feizi
Shravan Nayak
Xiangru Jian
Kevin Qinghong Lin
Kaixin Li
...
Reihaneh Rabbany
Perouz Taslakian
C. Pal
Spandana Gella
Sai Rajeswar
LLMAG
172
1
0
10 Nov 2025
UI-Ins: Enhancing GUI Grounding with Multi-Perspective Instruction-as-Reasoning
Liangyu Chen
Zhengyu Ma
C. Cai
J. Zhang
Panrong Tong
...
Yuqi Liu
Wenxuan Wang
Yue Wang
Qin Jin
Steven C. H. Hoi
LRM
136
4
0
23 Oct 2025
UIPro: Unleashing Superior Interaction Capability For GUI Agents
Hongxin Li
Jingran Su
Jingfan Chen
Zheng Ju
Yuntao Chen
Qing Li
Zhaoxiang Zhang
LLMAG
235
0
0
22 Sep 2025
See, Think, Act: Teaching Multimodal Agents to Effectively Interact with GUI by Identifying Toggles
Zongru Wu
Rui Mao
Zhiyuan Tian
Pengzhou Cheng
Tianjie Ju
Zheng Wu
Lingzhong Dong
Haiyue Sheng
Zhuosheng Zhang
Gongshen Liu
128
0
0
17 Sep 2025
Towards Understanding Visual Grounding in Visual Language Models
Georgios Pantazopoulos
Eda B. Özyiğit
ObjD
320
3
0
12 Sep 2025
UItron: Foundational GUI Agent with Advanced Perception and Planning
Zhixiong Zeng
Jing Huang
Liming Zheng
Wenkang Han
Yufeng Zhong
Lei Chen
Longrong Yang
Yingjie Chu
Yuzhi He
Lin Ma
LLMAG
193
6
0
29 Aug 2025
AccessGuru: Leveraging LLMs to Detect and Correct Web Accessibility Violations in HTML Code
Nadeen Fathallah
Daniel Hernández
Steffen Staab
3DV
VLM
144
2
0
24 Jul 2025
MagicGUI: A Foundational Mobile GUI Agent with Scalable Data Pipeline and Reinforcement Fine-tuning
Liujian Tang
Shaokang Dong
Y. Huang
Minqi Xiang
Hongtao Ruan
...
Qi Zhang
Kang Wang
Y. Zhang
Y. Wang
Yuran Wang
LM&Ro
426
6
0
19 Jul 2025
VisualTrap: A Stealthy Backdoor Attack on GUI Agents via Visual Grounding Manipulation
Ziang Ye
Yang Zhang
Wentao Shi
Xiaoyu You
Fuli Feng
Tat-Seng Chua
AAML
313
3
0
09 Jul 2025
GTA1: GUI Test-time Scaling Agent
Yan Yang
Dongxu Li
Yutong Dai
Yuhao Yang
Ziyang Luo
...
Ran Xu
Liyuan Pan
Silvio Savarese
Caiming Xiong
Junnan Li
LLMAG
402
40
0
08 Jul 2025
GUI-Reflection: Empowering Multimodal GUI Models with Self-Reflection Behavior
Penghao Wu
Shengnan Ma
Bo Wang
Jiaheng Yu
Lewei Lu
Ziwei Liu
244
10
0
09 Jun 2025
Scaling Computer-Use Grounding via User Interface Decomposition and Synthesis
Tianbao Xie
Jiaqi Deng
Xiaochuan Li
Junlin Yang
Haoyuan Wu
...
Yiheng Xu
Junli Wang
Doyen Sahoo
Tao Yu
Caiming Xiong
401
52
0
19 May 2025
GUI-Shift: Enhancing VLM-Based GUI Agents through Self-supervised Reinforcement Learning
Longxi Gao
Li Zhang
Mengwei Xu
Wei Liu
Jian Luan
Mengwei Xu
437
4
0
18 May 2025
A Survey on the Safety and Security Threats of Computer-Using Agents: JARVIS or Ultron?
Ada Chen
Yongjiang Wu
Jing Zhang
Shu Yang
Shu Yang
Jen-tse Huang
Wenxuan Wang
Wenxuan Wang
S. Wang
ELM
441
11
0
16 May 2025
InfiGUI-R1: Advancing Multimodal GUI Agents from Reactive Actors to Deliberative Reasoners
Yuhang Liu
Pengxiang Li
C. Xie
Xavier Hu
Xiaotian Han
Shengyu Zhang
Hongxia Yang
Fei Wu
LLMAG
LM&Ro
LRM
AI4CE
382
73
0
19 Apr 2025
UI-E2I-Synth: Advancing GUI Grounding with Large-Scale Instruction Synthesis
Annual Meeting of the Association for Computational Linguistics (ACL), 2025
Xinyi Liu
Xiaoyi Zhang
Ziyun Zhang
Yan Lu
565
13
0
15 Apr 2025
Navi-plus: Managing Ambiguous GUI Navigation Tasks with Follow-up Questions
Ziming Cheng
Zhiyuan Huang
Junting Pan
Zhaohui Hou
Mingjie Zhan
393
5
0
31 Mar 2025
UI-Vision: A Desktop-centric GUI Benchmark for Visual Perception and Interaction
Shravan Nayak
Xiangru Jian
Kevin Qinghong Lin
Juan A. Rodriguez
Montek Kalsi
...
David Vazquez
Christopher Pal
Perouz Taslakian
Spandana Gella
Sai Rajeswar
1.2K
30
0
19 Mar 2025
MP-GUI: Modality Perception with MLLMs for GUI Understanding
Computer Vision and Pattern Recognition (CVPR), 2025
Ziwei Wang
Weizhi Chen
Leyang Yang
Sheng Zhou
Shengchu Zhao
Hanbei Zhan
Jiongchao Jin
Liangcheng Li
Zirui Shao
Jiajun Bu
338
9
0
18 Mar 2025
Unified Autoregressive Visual Generation and Understanding with Continuous Tokens
Lijie Fan
Luming Tang
Siyang Qin
Tianhong Li
Xuan S. Yang
...
Tao Zhu
Michael Rubinstein
Michalis Raptis
Deqing Sun
Radu Soricut
316
28
0
17 Mar 2025
DeskVision: Large Scale Desktop Region Captioning for Advanced GUI Agents
Yibin Xu
Liang Yang
Hao Chen
Hua Wang
Zhi Chen
Yaohua Tang
3DV
356
2
0
14 Mar 2025
SpiritSight Agent: Advanced GUI Agent with One Look
Computer Vision and Pattern Recognition (CVPR), 2025
Zhiyuan Huang
Ziming Cheng
Junting Pan
Zhaohui Hou
Mingjie Zhan
LLMAG
423
11
0
05 Mar 2025
MobileSteward: Integrating Multiple App-Oriented Agents with Self-Evolution to Automate Cross-App Instructions
Knowledge Discovery and Data Mining (KDD), 2025
Yuxuan Liu
Hongda Sun
Wei Liu
Jian Luan
Bo Du
Rui Yan
440
9
0
24 Feb 2025
AutoGUI: Scaling GUI Grounding with Automatic Functionality Annotations from LLMs
Annual Meeting of the Association for Computational Linguistics (ACL), 2025
Hongxin Li
Jingfan Chen
Jingran Su
Yuntao Chen
Qing Li
Rundong Wang
995
8
0
04 Feb 2025
InfiGUIAgent: A Multimodal Generalist GUI Agent with Native Reasoning and Reflection
Yunxing Liu
Pengxiang Li
Zishu Wei
C. Xie
Xueyu Hu
Xinchen Xu
Shengyu Zhang
Xiaotian Han
Hongxia Yang
Leilei Gan
LLMAG
LRM
318
44
0
08 Jan 2025
Towards Human-AI Synergy in UI Design: Enhancing Multi-Agent Based UI Generation with Intent Clarification and Alignment
M. Yuan
Jieshan Chen
Yongquan Hu
Sidong Feng
Mulong Xie
Gelareh Mohammadi
Zhenchang Xing
Aaron Quigley
LLMAG
244
1
0
28 Dec 2024
Aria-UI: Visual Grounding for GUI Instructions
Annual Meeting of the Association for Computational Linguistics (ACL), 2024
Yuhao Yang
Yue Wang
Dongxu Li
Ziyang Luo
Bei Chen
Chenyu Huang
Junnan Li
LM&Ro
LLMAG
502
94
0
20 Dec 2024
Falcon-UI: Understanding GUI Before Following User Instructions
Huawen Shen
Yu Xie
Gengluo Li
Xinlong Wang
Can Ma
Can Ma
Xiangyang Ji
LLMAG
379
16
0
12 Dec 2024
Foundations and Recent Trends in Multimodal Mobile Agents: A Survey
Biao Wu
Yanda Li
Meng Fang
Zirui Song
Zhiwei Zhang
Yunchao Wei
LM&Ro
LLMAG
OffRL
AI4TS
426
19
0
04 Nov 2024
EDGE: Enhanced Grounded GUI Understanding with Enriched Multi-Granularity Synthetic Data
Xuetian Chen
Hangcheng Li
Jiaqing Liang
Sihang Jiang
Deqing Yang
LLMAG
466
7
0
25 Oct 2024
Harnessing Webpage UIs for Text-Rich Visual Understanding
International Conference on Learning Representations (ICLR), 2024
Junpeng Liu
Tianyue Ou
Yifan Song
Yuxiao Qu
Wai Lam
Chenyan Xiong
Lei Ma
Graham Neubig
Xiang Yue
377
21
0
17 Oct 2024
Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents
International Conference on Learning Representations (ICLR), 2024
Boyu Gou
Ruohan Wang
Boyuan Zheng
Yanan Xie
Cheng Chang
Yiheng Shu
Huan Sun
Eric Fosler-Lussier
LM&Ro
LLMAG
636
236
0
07 Oct 2024
MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-tuning
Haotian Zhang
Mingfei Gao
Zhe Gan
Philipp Dufter
Nina Wenzel
...
Haoxuan You
Zirui Wang
Afshin Dehghan
Peter Grasch
Yinfei Yang
VLM
MLLM
303
66
1
30 Sep 2024
Inferring Alt-text For UI Icons With Large Language Models During App Development
Sabrina Haque
Christoph Csallner
VLM
263
2
0
26 Sep 2024
MobileViews: A Large-Scale Mobile GUI Dataset
Longxi Gao
Li Zhang
Shihe Wang
Shangguang Wang
Yuanchun Li
Mengwei Xu
208
13
0
22 Sep 2024
WebQuest: A Benchmark for Multimodal QA on Web Page Sequences
Maria Wang
Srinivas Sunkara
Gilles Baechler
Jason Lin
Yun Zhu
Fedir Zubach
Lei Shu
Jindong Chen
LRM
LLMAG
306
12
0
06 Sep 2024
OmniParser for Pure Vision Based GUI Agent
Yadong Lu
Jianwei Yang
Yelong Shen
Ahmed Hassan Awadallah
MLLM
334
121
0
01 Aug 2024
Flowy: Supporting UX Design Decisions Through AI-Driven Pattern Annotation in Multi-Screen User Flows
Yuwen Lu
Ziang Tong
Qinyi Zhao
Yewon Oh
Bryan Wang
Toby Jia-Jun Li
309
9
0
23 Jun 2024
Tell Me What's Next: Textual Foresight for Generic UI Representations
Andrea Burns
Kate Saenko
Bryan A. Plummer
LM&Ro
AI4TS
282
7
0
12 Jun 2024
MLLMGuard: A Multi-dimensional Safety Evaluation Suite for Multimodal Large Language Models
Tianle Gu
Zeyang Zhou
Kexin Huang
Dandan Liang
Yixu Wang
...
Keqing Wang
Yujiu Yang
Yan Teng
Botian Shi
Yingchun Wang
ELM
277
31
0
11 Jun 2024
MUD: Towards a Large-Scale and Noise-Filtered UI Dataset for Modern Style UI Modeling
Sidong Feng
Suyu Ma
Han Wang
David Kong
Chunyang Chen
266
16
0
11 May 2024
GUing: A Mobile GUI Search Engine using a Vision-Language Model
Jialiang Wei
A. Courbis
Thomas Lambolais
Binbin Xu
P. Bernard
Gérard Dray
Walid Maalej
DiffM
CLIP
197
11
0
30 Apr 2024
Benchmarking Mobile Device Control Agents across Diverse Configurations
Juyong Lee
Taywon Min
Minyong An
Dongyoon Hahm
Kimin Lee
Changyeon Kim
Kimin Lee
360
30
0
25 Apr 2024
VisualWebBench: How Far Have Multimodal LLMs Evolved in Web Page Understanding and Grounding?
Junpeng Liu
Yifan Song
Bill Yuchen Lin
Wai Lam
Graham Neubig
Yuanzhi Li
Xiang Yue
VLM
289
79
0
09 Apr 2024
DeepSeek-VL: Towards Real-World Vision-Language Understanding
Haoyu Lu
Wen Liu
Bo Zhang
Bing-Li Wang
Kai Dong
...
Yaofeng Sun
Chengqi Deng
Hanwei Xu
Zhenda Xie
Chong Ruan
VLM
457
647
0
08 Mar 2024
Enhancing Vision-Language Pre-training with Rich Supervisions
Yuan Gao
Kunyu Shi
Pengkai Zhu
Edouard Belval
Oren Nuriel
Srikar Appalaraju
Shabnam Ghadar
Vijay Mahadevan
Zhuowen Tu
Stefano Soatto
VLM
CLIP
412
15
0
05 Mar 2024
OmniACT: A Dataset and Benchmark for Enabling Multimodal Generalist Autonomous Agents for Desktop and Web
Raghav Kapoor
Y. Butala
M. Russak
Jing Yu Koh
Kiran Kamble
Waseem Alshikh
Ruslan Salakhutdinov
LLMAG
493
105
0
27 Feb 2024
ScreenAgent: A Vision Language Model-driven Computer Control Agent
Runliang Niu
Jindong Li
Shiqi Wang
Yali Fu
Xiyu Hu
Xueyuan Leng
He Kong
Yi Chang
Zhiqiang Zhang
LLMAG
MLLM
LM&Ro
317
80
0
09 Feb 2024
AI Assistance for UX: A Literature Review Through Human-Centered AI
Yuwen Lu
Yuewen Yang
Qinyi Zhao
Chengzhi Zhang
Toby Jia-Jun Li
284
32
0
08 Feb 2024
ScreenAI: A Vision-Language Model for UI and Infographics Understanding
Gilles Baechler
Srinivas Sunkara
Maria Wang
Fedir Zubach
Hassan Mansoor
Vincent Etter
Victor Carbune
Jason Lin
Jindong Chen
Abhanshu Sharma
853
96
0
07 Feb 2024
1
2
Next