ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2026 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2411.04890
  4. Cited By
GUI Agents with Foundation Models: A Comprehensive Survey
v1v2 (latest)

GUI Agents with Foundation Models: A Comprehensive Survey

7 November 2024
Shuai Wang
Wen Liu
Jingxuan Chen
Weinan Gan
Xingshan Zeng
Shuai Yu
Xinlong Hao
Youssef Attia El Hili
Yasheng Wang
Ruiming Tang
Bin Wang
Chuhan Wu
Yasheng Wang
Ruiming Tang
Jianye Hao
    LLMAG
ArXiv (abs)PDFHTML

Papers citing "GUI Agents with Foundation Models: A Comprehensive Survey"

27 / 77 papers shown
Mobile-Agent-v2: Mobile Device Operation Assistant with Effective
  Navigation via Multi-Agent Collaboration
Mobile-Agent-v2: Mobile Device Operation Assistant with Effective Navigation via Multi-Agent Collaboration
Junyang Wang
Haiyang Xu
Haitao Jia
Xi Zhang
Ming Yan
Weizhou Shen
Ji Zhang
Fei Huang
Jitao Sang
LM&RoLLMAG
378
144
0
03 Jun 2024
AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents
AndroidWorld: A Dynamic Benchmarking Environment for Autonomous AgentsInternational Conference on Learning Representations (ICLR), 2024
Christopher Rawles
Sarah Clinckemaillie
Yifan Chang
Jonathan Waltz
Gabrielle Lau
...
Daniel Toyama
Robert Berry
Divya Tyamagundlu
Timothy Lillicrap
Oriana Riva
LLMAG
658
186
0
23 May 2024
Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs
Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs
Keen You
Haotian Zhang
E. Schoop
Floris Weers
Amanda Swearngin
Jeffrey Nichols
Yinfei Yang
Zhe Gan
MLLM
357
150
0
08 Apr 2024
Android in the Zoo: Chain-of-Action-Thought for GUI Agents
Android in the Zoo: Chain-of-Action-Thought for GUI Agents
Jiwen Zhang
Jihao Wu
Yihua Teng
Minghui Liao
Nuo Xu
Xiao Xiao
Zhongyu Wei
Duyu Tang
LLMAGLM&Ro
480
129
0
05 Mar 2024
UFO: A UI-Focused Agent for Windows OS Interaction
UFO: A UI-Focused Agent for Windows OS Interaction
Chaoyun Zhang
Liqun Li
Shilin He
Xu Zhang
Bo Qiao
...
Yu Kang
Qingwei Lin
Saravan Rajmohan
Dongmei Zhang
Qi Zhang
LLMAG
554
129
0
08 Feb 2024
Mobile-Agent: Autonomous Multi-Modal Mobile Device Agent with Visual
  Perception
Mobile-Agent: Autonomous Multi-Modal Mobile Device Agent with Visual Perception
Junyang Wang
Haiyang Xu
Jiabo Ye
Mingshi Yan
Weizhou Shen
Ji Zhang
Fei Huang
Jitao Sang
331
226
0
29 Jan 2024
GPT-4V(ision) is a Generalist Web Agent, if Grounded
GPT-4V(ision) is a Generalist Web Agent, if GroundedInternational Conference on Machine Learning (ICML), 2024
Boyuan Zheng
Boyu Gou
Jihyung Kil
Huan Sun
Yu-Chuan Su
MLLMVLMLLMAG
407
407
0
03 Jan 2024
GPT-4V in Wonderland: Large Multimodal Models for Zero-Shot Smartphone
  GUI Navigation
GPT-4V in Wonderland: Large Multimodal Models for Zero-Shot Smartphone GUI Navigation
An Yan
Zhengyuan Yang
Wanrong Zhu
Kevin Qinghong Lin
Linjie Li
...
Yiwu Zhong
Julian McAuley
Jianfeng Gao
Zicheng Liu
Lijuan Wang
LLMAGLM&Ro
402
145
0
13 Nov 2023
Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V
Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V
Jianwei Yang
Hao Zhang
Feng Li
Xueyan Zou
Chun-yue Li
Jianfeng Gao
MLLMVLM
447
269
0
17 Oct 2023
ILuvUI: Instruction-tuned LangUage-Vision modeling of UIs from Machine
  Conversations
ILuvUI: Instruction-tuned LangUage-Vision modeling of UIs from Machine ConversationsInternational Conference on Intelligent User Interfaces (IUI), 2023
Yue Jiang
E. Schoop
Amanda Swearngin
Jeffrey Nichols
MLLM
359
25
0
07 Oct 2023
You Only Look at Screens: Multimodal Chain-of-Action Agents
You Only Look at Screens: Multimodal Chain-of-Action AgentsAnnual Meeting of the Association for Computational Linguistics (ACL), 2023
Zhuosheng Zhang
Aston Zhang
LLMAGLM&Ro
489
170
0
20 Sep 2023
AutoDroid: LLM-powered Task Automation in Android
AutoDroid: LLM-powered Task Automation in AndroidACM/IEEE International Conference on Mobile Computing and Networking (MobiCom), 2023
Hao Wen
Yuanchun Li
Guohong Liu
Shanhui Zhao
Tao Yu
Toby Jia-Jun Li
Shiqi Jiang
Yunhao Liu
Yaqin Zhang
Yunxin Liu
431
183
0
29 Aug 2023
A Survey on Large Language Model based Autonomous Agents
A Survey on Large Language Model based Autonomous Agents
Lei Wang
Chengbang Ma
Xueyang Feng
Zeyu Zhang
Hao-ran Yang
...
Xu Chen
Yankai Lin
Wayne Xin Zhao
Zhewei Wei
Ji-Rong Wen
LLMAGAI4CELM&Ro
670
2,158
0
22 Aug 2023
WebArena: A Realistic Web Environment for Building Autonomous Agents
WebArena: A Realistic Web Environment for Building Autonomous AgentsInternational Conference on Learning Representations (ICLR), 2023
Shuyan Zhou
Frank F. Xu
Hao Zhu
Xuhui Zhou
Robert Lo
...
Tianyue Ou
Yonatan Bisk
Daniel Fried
Uri Alon
Graham Neubig
LLMAG
685
861
0
25 Jul 2023
Android in the Wild: A Large-Scale Dataset for Android Device Control
Android in the Wild: A Large-Scale Dataset for Android Device ControlNeural Information Processing Systems (NeurIPS), 2023
Christopher Rawles
Alice Li
Daniel Rodriguez
Oriana Riva
Timothy Lillicrap
LM&Ro
409
256
0
19 Jul 2023
Multimodal Web Navigation with Instruction-Finetuned Foundation Models
Multimodal Web Navigation with Instruction-Finetuned Foundation ModelsInternational Conference on Learning Representations (ICLR), 2023
Hiroki Furuta
Kuang-Huei Lee
Ofir Nachum
Yutaka Matsuo
Aleksandra Faust
S. Gu
Izzeddin Gur
LM&Ro
427
143
0
19 May 2023
Language Models can Solve Computer Tasks
Language Models can Solve Computer TasksNeural Information Processing Systems (NeurIPS), 2023
Geunwoo Kim
Pierre Baldi
Alexander Shmakov
LLMAGLM&Ro
572
467
0
30 Mar 2023
ReAct: Synergizing Reasoning and Acting in Language Models
ReAct: Synergizing Reasoning and Acting in Language ModelsInternational Conference on Learning Representations (ICLR), 2022
Shunyu Yao
Jeffrey Zhao
Dian Yu
Nan Du
Izhak Shafran
Karthik Narasimhan
Yuan Cao
LLMAGReLMLRM
2.6K
5,491
0
06 Oct 2022
Spotlight: Mobile UI Understanding using Vision-Language Models with a
  Focus
Spotlight: Mobile UI Understanding using Vision-Language Models with a FocusInternational Conference on Learning Representations (ICLR), 2022
Gang Li
Yang Li
361
82
0
29 Sep 2022
A data-driven approach for learning to control computers
A data-driven approach for learning to control computersInternational Conference on Machine Learning (ICML), 2022
Peter C. Humphreys
David Raposo
Tobias Pohlen
Gregory Thornton
Rachita Chhaparia
...
Josh Abramson
Petko Georgiev
Alex Goldin
Adam Santoro
Timothy Lillicrap
334
115
0
16 Feb 2022
Environment Generation for Zero-Shot Compositional Reinforcement
  Learning
Environment Generation for Zero-Shot Compositional Reinforcement LearningNeural Information Processing Systems (NeurIPS), 2022
Izzeddin Gur
Natasha Jaques
Yingjie Miao
Jongwook Choi
Manoj Kumar Tiwari
Honglak Lee
Aleksandra Faust
264
45
0
21 Jan 2022
VUT: Versatile UI Transformer for Multi-Modal Multi-Task User Interface
  Modeling
VUT: Versatile UI Transformer for Multi-Modal Multi-Task User Interface Modeling
Yang Li
Gang Li
Xin Zhou
Mostafa Dehghani
A. Gritsenko
MLLM
182
40
0
10 Dec 2021
Screen2Words: Automatic Mobile UI Summarization with Multimodal Learning
Screen2Words: Automatic Mobile UI Summarization with Multimodal LearningACM Symposium on User Interface Software and Technology (UIST), 2021
Bryan Wang
Gang Li
Xin Zhou
Zhourong Chen
Tovi Grossman
Yang Li
850
198
0
07 Aug 2021
UIBert: Learning Generic Multimodal Representations for UI Understanding
UIBert: Learning Generic Multimodal Representations for UI UnderstandingInternational Joint Conference on Artificial Intelligence (IJCAI), 2021
Chongyang Bai
Xiaoxue Zang
Ying Xu
Srinivas Sunkara
Abhinav Rastogi
Jindong Chen
Blaise Agüera y Arcas
274
113
0
29 Jul 2021
Screen Recognition: Creating Accessibility Metadata for Mobile
  Applications from Pixels
Screen Recognition: Creating Accessibility Metadata for Mobile Applications from PixelsInternational Conference on Human Factors in Computing Systems (CHI), 2021
Xiaoyi Zhang
Lilian de Greef
Amanda Swearngin
Samuel White
Kyle I. Murray
...
Jeffrey Nichols
Jason Wu
Chris Fleizach
Aaron Everitt
Jeffrey P. Bigham
746
192
0
13 Jan 2021
Learning to Navigate the Web
Learning to Navigate the Web
Izzeddin Gur
U. Rückert
Aleksandra Faust
Dilek Z. Hakkani-Tür
226
71
0
21 Dec 2018
Reinforcement Learning on Web Interfaces Using Workflow-Guided
  Exploration
Reinforcement Learning on Web Interfaces Using Workflow-Guided Exploration
Emmy Liu
Kelvin Guu
Panupong Pasupat
Tianlin Shi
Abigail Z. Jacobs
OnRL
211
281
0
24 Feb 2018
Previous
12
Page 2 of 2