v1v2 (latest)

GUI Agents with Foundation Models: A Comprehensive Survey

7 November 2024

Youssef Attia El Hili

Yasheng Wang

Ruiming Tang

Bin Wang

Chuhan Wu

Yasheng Wang

Ruiming Tang

Jianye Hao

LLMAG

ArXiv (abs)PDF HTML

Papers citing "GUI Agents with Foundation Models: A Comprehensive Survey"

27 / 77 papers shown

Mobile-Agent-v2: Mobile Device Operation Assistant with Effective Navigation via Multi-Agent Collaboration

Ming Yan

Ji Zhang

Fei Huang

Jitao Sang

LM&Ro LLMAG

378

144

03 Jun 2024

AndroidWorld: A Dynamic Benchmarking Environment for Autonomous AgentsInternational Conference on Learning Representations (ICLR), 2024

...

Daniel Toyama

658

186

23 May 2024

Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs

357

150

08 Apr 2024

Android in the Zoo: Chain-of-Action-Thought for GUI Agents

480

129

05 Mar 2024

UFO: A UI-Focused Agent for Windows OS Interaction

...

554

129

08 Feb 2024

Mobile-Agent: Autonomous Multi-Modal Mobile Device Agent with Visual Perception

Jiabo Ye

Ji Zhang

Fei Huang

Jitao Sang

331

226

29 Jan 2024

GPT-4V(ision) is a Generalist Web Agent, if GroundedInternational Conference on Machine Learning (ICML), 2024

Huan Sun

407

03 Jan 2024

GPT-4V in Wonderland: Large Multimodal Models for Zero-Shot Smartphone GUI Navigation

...

Julian McAuley

Zicheng Liu

402

145

13 Nov 2023

Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V

Jianwei Yang

447

269

17 Oct 2023

ILuvUI: Instruction-tuned LangUage-Vision modeling of UIs from Machine ConversationsInternational Conference on Intelligent User Interfaces (IUI), 2023

359

07 Oct 2023

You Only Look at Screens: Multimodal Chain-of-Action AgentsAnnual Meeting of the Association for Computational Linguistics (ACL), 2023

Zhuosheng Zhang

Aston Zhang

LLMAG LM&Ro

489

170

20 Sep 2023

AutoDroid: LLM-powered Task Automation in AndroidACM/IEEE International Conference on Mobile Computing and Networking (MobiCom), 2023

431

183

29 Aug 2023

A Survey on Large Language Model based Autonomous Agents

Lei Wang

...

Yankai Lin

670

2,158

22 Aug 2023

WebArena: A Realistic Web Environment for Building Autonomous AgentsInternational Conference on Learning Representations (ICLR), 2023

Xuhui Zhou

...

Daniel Fried

Graham Neubig

685

861

25 Jul 2023

Android in the Wild: A Large-Scale Dataset for Android Device ControlNeural Information Processing Systems (NeurIPS), 2023

409

256

19 Jul 2023

Multimodal Web Navigation with Instruction-Finetuned Foundation ModelsInternational Conference on Learning Representations (ICLR), 2023

Hiroki Furuta

427

143

19 May 2023

Language Models can Solve Computer TasksNeural Information Processing Systems (NeurIPS), 2023

572

467

30 Mar 2023

ReAct: Synergizing Reasoning and Acting in Language ModelsInternational Conference on Learning Representations (ICLR), 2022

Dian Yu

2.6K

5,491

06 Oct 2022

Spotlight: Mobile UI Understanding using Vision-Language Models with a FocusInternational Conference on Learning Representations (ICLR), 2022

Gang Li

Yang Li

361

29 Sep 2022

A data-driven approach for learning to control computersInternational Conference on Machine Learning (ICML), 2022

...

334

115

16 Feb 2022

Environment Generation for Zero-Shot Compositional Reinforcement LearningNeural Information Processing Systems (NeurIPS), 2022

264

21 Jan 2022

VUT: Versatile UI Transformer for Multi-Modal Multi-Task User Interface Modeling

182

10 Dec 2021

Screen2Words: Automatic Mobile UI Summarization with Multimodal LearningACM Symposium on User Interface Software and Technology (UIST), 2021

850

198

07 Aug 2021

UIBert: Learning Generic Multimodal Representations for UI UnderstandingInternational Joint Conference on Artificial Intelligence (IJCAI), 2021

Blaise Agüera y Arcas

274

113

29 Jul 2021

Screen Recognition: Creating Accessibility Metadata for Mobile Applications from PixelsInternational Conference on Human Factors in Computing Systems (CHI), 2021

...

746

192

13 Jan 2021

Learning to Navigate the Web

226

21 Dec 2018

Reinforcement Learning on Web Interfaces Using Workflow-Guided Exploration

211

281

24 Feb 2018