ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2306.00245
  4. Cited By
From Pixels to UI Actions: Learning to Follow Instructions via Graphical
  User Interfaces

From Pixels to UI Actions: Learning to Follow Instructions via Graphical User Interfaces

31 May 2023
Peter Shaw
Mandar Joshi
James Cohan
Jonathan Berant
Panupong Pasupat
Hexiang Hu
Urvashi Khandelwal
Kenton Lee
Kristina Toutanova
    LLMAG
    LM&Ro
ArXivPDFHTML

Papers citing "From Pixels to UI Actions: Learning to Follow Instructions via Graphical User Interfaces"

48 / 48 papers shown
Title
Visual Test-time Scaling for GUI Agent Grounding
Visual Test-time Scaling for GUI Agent Grounding
Tiange Luo
Lajanugen Logeswaran
Justin Johnson
Honglak Lee
51
0
0
01 May 2025
Cracking the Code of Action: a Generative Approach to Affordances for Reinforcement Learning
Cracking the Code of Action: a Generative Approach to Affordances for Reinforcement Learning
Lynn Cherif
Flemming Kondrup
David Venuto
Ankit Anand
Doina Precup
Khimya Khetarpal
LM&Ro
51
0
0
24 Apr 2025
TongUI: Building Generalized GUI Agents by Learning from Multimodal Web Tutorials
TongUI: Building Generalized GUI Agents by Learning from Multimodal Web Tutorials
Bofei Zhang
Zirui Shang
Zhi Gao
Wang Zhang
Rui Xie
Xiaojian Ma
Tao Yuan
Xinxiao Wu
Song-Chun Zhu
Qing Li
LLMAG
37
1
0
17 Apr 2025
AgentRewardBench: Evaluating Automatic Evaluations of Web Agent Trajectories
AgentRewardBench: Evaluating Automatic Evaluations of Web Agent Trajectories
Xing Han Lù
Amirhossein Kazemnejad
Nicholas Meade
Arkil Patel
Dongchan Shin
Alejandra Zambrano
Karolina Stañczak
Peter Shaw
Christopher Pal
Siva Reddy
LLMAG
40
1
0
11 Apr 2025
SkillWeaver: Web Agents can Self-Improve by Discovering and Honing Skills
SkillWeaver: Web Agents can Self-Improve by Discovering and Honing Skills
Boyuan Zheng
Michael Y. Fatemi
Xiaolong Jin
Zhilin Wang
Apurva Gandhi
...
Yu Gu
Jayanth Srinivasa
Gaowen Liu
Graham Neubig
Yu Su
CLL
38
1
0
09 Apr 2025
On the Robustness of GUI Grounding Models Against Image Attacks
On the Robustness of GUI Grounding Models Against Image Attacks
Haoren Zhao
Tianyi Chen
Zhen Wang
AAML
36
0
0
07 Apr 2025
GUI-Xplore: Empowering Generalizable GUI Agents with One Exploration
GUI-Xplore: Empowering Generalizable GUI Agents with One Exploration
Yuchen Sun
Shanhui Zhao
Tao Yu
Hao Wen
Samith Va
Mengwei Xu
Yuanchun Li
Chongyang Zhang
LLMAG
62
0
0
22 Mar 2025
SafeArena: Evaluating the Safety of Autonomous Web Agents
Ada Defne Tur
Nicholas Meade
Xing Han Lù
Alejandra Zambrano
Arkil Patel
Esin Durmus
Spandana Gella
Karolina Stañczak
Siva Reddy
LLMAG
ELM
87
2
0
06 Mar 2025
SpiritSight Agent: Advanced GUI Agent with One Look
SpiritSight Agent: Advanced GUI Agent with One Look
Zhiyuan Huang
Ziming Cheng
Junting Pan
Zhaohui Hou
Mingjie Zhan
LLMAG
101
2
0
05 Mar 2025
Magma: A Foundation Model for Multimodal AI Agents
Magma: A Foundation Model for Multimodal AI Agents
Jianwei Yang
Reuben Tan
Qianhui Wu
Ruijie Zheng
Baolin Peng
...
Seonghyeon Ye
Joel Jang
Yuquan Deng
Lars Liden
Jianfeng Gao
VLM
AI4TS
122
9
0
18 Feb 2025
Exposing Limitations of Language Model Agents in Sequential-Task Compositions on the Web
Exposing Limitations of Language Model Agents in Sequential-Task Compositions on the Web
Hiroki Furuta
Yutaka Matsuo
Aleksandra Faust
Izzeddin Gur
CLL
89
14
0
03 Jan 2025
The BrowserGym Ecosystem for Web Agent Research
The BrowserGym Ecosystem for Web Agent Research
Thibault Le Sellier De Chezelles
Maxime Gasse
Alexandre Lacoste
Alexandre Drouin
Massimo Caccia
...
Siva Reddy
Quentin Cappart
Graham Neubig
Ruslan Salakhutdinov
Nicolas Chapados
LLMAG
106
9
0
06 Dec 2024
ShowUI: One Vision-Language-Action Model for GUI Visual Agent
ShowUI: One Vision-Language-Action Model for GUI Visual Agent
Kevin Qinghong Lin
Linjie Li
Difei Gao
Z. Yang
Shiwei Wu
Zechen Bai
Weixian Lei
Lijuan Wang
Mike Zheng Shou
LLMAG
74
13
0
26 Nov 2024
Generalist Virtual Agents: A Survey on Autonomous Agents Across Digital Platforms
Minghe Gao
Wendong Bu
Bingchen Miao
Yang Wu
Yunfei Li
Juncheng Billy Li
Siliang Tang
Qi Wu
Yueting Zhuang
Meng Wang
LM&Ro
45
3
0
17 Nov 2024
Agent S: An Open Agentic Framework that Uses Computers Like a Human
Agent S: An Open Agentic Framework that Uses Computers Like a Human
Saaket Agashe
Jiuzhou Han
Shuyu Gan
Jiachen Yang
Ang Li
Xin Eric Wang
LLMAG
LM&Ro
AIFin
39
20
0
10 Oct 2024
On the Modeling Capabilities of Large Language Models for Sequential
  Decision Making
On the Modeling Capabilities of Large Language Models for Sequential Decision Making
Martin Klissarov
Devon Hjelm
Alexander Toshev
Bogdan Mazoure
LM&Ro
ELM
OffRL
LRM
34
2
0
08 Oct 2024
Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents
Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents
Boyu Gou
Ruohan Wang
Boyuan Zheng
Yanan Xie
Cheng Chang
Yiheng Shu
Huan Sun
Yu Su
LM&Ro
LLMAG
76
49
0
07 Oct 2024
From Reward Shaping to Q-Shaping: Achieving Unbiased Learning with
  LLM-Guided Knowledge
From Reward Shaping to Q-Shaping: Achieving Unbiased Learning with LLM-Guided Knowledge
Xiefeng Wu
OffRL
34
1
0
02 Oct 2024
Synatra: Turning Indirect Knowledge into Direct Demonstrations for
  Digital Agents at Scale
Synatra: Turning Indirect Knowledge into Direct Demonstrations for Digital Agents at Scale
Tianyue Ou
Frank F. Xu
Aman Madaan
J. Liu
Robert Lo
Abishek Sridhar
Sudipta Sengupta
Dan Roth
Graham Neubig
Shuyan Zhou
OffRL
36
9
0
24 Sep 2024
OmniParser for Pure Vision Based GUI Agent
OmniParser for Pure Vision Based GUI Agent
Yadong Lu
Jianwei Yang
Yelong Shen
Ahmed Hassan Awadallah
MLLM
32
33
0
01 Aug 2024
GUICourse: From General Vision Language Models to Versatile GUI Agents
GUICourse: From General Vision Language Models to Versatile GUI Agents
Wentong Chen
Junbo Cui
Jinyi Hu
Yujia Qin
Junjie Fang
...
Yupeng Huo
Yuan Yao
Yankai Lin
Zhiyuan Liu
Maosong Sun
LLMAG
33
30
0
17 Jun 2024
VideoGUI: A Benchmark for GUI Automation from Instructional Videos
VideoGUI: A Benchmark for GUI Automation from Instructional Videos
Kevin Qinghong Lin
Linjie Li
Difei Gao
Qinchen Wu
Mingyi Yan
Zhengyuan Yang
Lijuan Wang
Mike Zheng Shou
46
10
0
14 Jun 2024
CAAP: Context-Aware Action Planning Prompting to Solve Computer Tasks
  with Front-End UI Only
CAAP: Context-Aware Action Planning Prompting to Solve Computer Tasks with Front-End UI Only
Junhee Cho
Jihoon Kim
Daseul Bae
Jinho Choo
Youngjune Gwon
Yeong-Dae Kwon
LLMAG
36
1
0
11 Jun 2024
AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents
AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents
Christopher Rawles
Sarah Clinckemaillie
Yifan Chang
Jonathan Waltz
Gabrielle Lau
...
Daniel Toyama
Robert Berry
Divya Tyamagundlu
Timothy Lillicrap
Oriana Riva
LLMAG
69
44
0
23 May 2024
Latent State Estimation Helps UI Agents to Reason
Latent State Estimation Helps UI Agents to Reason
Will Bishop
Alice Li
Christopher Rawles
Oriana Riva
LRM
LLMAG
19
3
0
17 May 2024
Enhancing Q-Learning with Large Language Model Heuristics
Enhancing Q-Learning with Large Language Model Heuristics
Xiefeng Wu
LRM
32
0
0
06 May 2024
Benchmarking Mobile Device Control Agents across Diverse Configurations
Benchmarking Mobile Device Control Agents across Diverse Configurations
Juyong Lee
Taywon Min
Minyong An
Changyeon Kim
Kimin Lee
36
9
0
25 Apr 2024
Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs
Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs
Keen You
Haotian Zhang
E. Schoop
Floris Weers
Amanda Swearngin
Jeffrey Nichols
Yinfei Yang
Zhe Gan
MLLM
47
82
0
08 Apr 2024
Computer User Interface Understanding. A New Dataset and a Learning
  Framework
Computer User Interface Understanding. A New Dataset and a Learning Framework
Andrés Munoz
Daniel Borrajo
30
0
0
15 Mar 2024
BAGEL: Bootstrapping Agents by Guiding Exploration with Language
BAGEL: Bootstrapping Agents by Guiding Exploration with Language
Shikhar Murty
Christopher D. Manning
Peter Shaw
Mandar Joshi
Kenton Lee
LM&Ro
LLMAG
26
14
0
12 Mar 2024
OmniACT: A Dataset and Benchmark for Enabling Multimodal Generalist
  Autonomous Agents for Desktop and Web
OmniACT: A Dataset and Benchmark for Enabling Multimodal Generalist Autonomous Agents for Desktop and Web
Raghav Kapoor
Y. Butala
M. Russak
Jing Yu Koh
Kiran Kamble
Waseem Alshikh
Ruslan Salakhutdinov
LLMAG
51
44
0
27 Feb 2024
WebLINX: Real-World Website Navigation with Multi-Turn Dialogue
WebLINX: Real-World Website Navigation with Multi-Turn Dialogue
Xing Han Lù
Zdeněk Kasner
Siva Reddy
32
59
0
08 Feb 2024
Dual-View Visual Contextualization for Web Navigation
Dual-View Visual Contextualization for Web Navigation
Jihyung Kil
Chan Hee Song
Boyuan Zheng
Xiang Deng
Yu-Chuan Su
Wei-Lun Chao
EgoV
22
12
0
06 Feb 2024
GPT-4V(ision) is a Generalist Web Agent, if Grounded
GPT-4V(ision) is a Generalist Web Agent, if Grounded
Boyuan Zheng
Boyu Gou
Jihyung Kil
Huan Sun
Yu-Chuan Su
MLLM
VLM
LLMAG
46
207
0
03 Jan 2024
ASSISTGUI: Task-Oriented Desktop Graphical User Interface Automation
ASSISTGUI: Task-Oriented Desktop Graphical User Interface Automation
Difei Gao
Lei Ji
Zechen Bai
Mingyu Ouyang
Peiran Li
...
Peiyi Wang
Xiangwu Guo
Hengxu Wang
Luowei Zhou
Mike Zheng Shou
LLMAG
23
21
0
20 Dec 2023
A Zero-Shot Language Agent for Computer Control with Structured
  Reflection
A Zero-Shot Language Agent for Computer Control with Structured Reflection
Tao Li
Gang Li
Zhiwei Deng
Bryan Wang
Yang Li
LM&Ro
LLMAG
59
23
0
12 Oct 2023
AXNav: Replaying Accessibility Tests from Natural Language
AXNav: Replaying Accessibility Tests from Natural Language
Maryam Taeb
Amanda Swearngin
E. Schoop
Ruijia Cheng
Yue Jiang
Jeffrey Nichols
28
37
0
03 Oct 2023
Motif: Intrinsic Motivation from Artificial Intelligence Feedback
Motif: Intrinsic Motivation from Artificial Intelligence Feedback
Martin Klissarov
P. DÓro
Shagun Sodhani
Roberta Raileanu
Pierre-Luc Bacon
Pascal Vincent
Amy Zhang
Mikael Henaff
LRM
LLMAG
29
54
0
29 Sep 2023
ExpeL: LLM Agents Are Experiential Learners
ExpeL: LLM Agents Are Experiential Learners
Andrew Zhao
Daniel Huang
Quentin Xu
Matthieu Lin
Yong-Jin Liu
Gao Huang
LLMAG
22
193
0
20 Aug 2023
WebArena: A Realistic Web Environment for Building Autonomous Agents
WebArena: A Realistic Web Environment for Building Autonomous Agents
Shuyan Zhou
Frank F. Xu
Hao Zhu
Xuhui Zhou
Robert Lo
...
Tianyue Ou
Yonatan Bisk
Daniel Fried
Uri Alon
Graham Neubig
LLMAG
36
381
0
25 Jul 2023
A Real-World WebAgent with Planning, Long Context Understanding, and
  Program Synthesis
A Real-World WebAgent with Planning, Long Context Understanding, and Program Synthesis
Izzeddin Gur
Hiroki Furuta
Austin Huang
Mustafa Safdari
Yutaka Matsuo
Douglas Eck
Aleksandra Faust
LM&Ro
LLMAG
39
198
0
24 Jul 2023
Synapse: Trajectory-as-Exemplar Prompting with Memory for Computer
  Control
Synapse: Trajectory-as-Exemplar Prompting with Memory for Computer Control
Longtao Zheng
R. Wang
Xinrun Wang
Bo An
LLMAG
22
57
0
13 Jun 2023
Vision-Language Models as Success Detectors
Vision-Language Models as Success Detectors
Yuqing Du
Ksenia Konyushkova
Misha Denil
A. Raju
Jessica Landon
Felix Hill
Nando de Freitas
Serkan Cabi
MLLM
LRM
89
77
0
13 Mar 2023
Understanding HTML with Large Language Models
Understanding HTML with Large Language Models
Izzeddin Gur
Ofir Nachum
Yingjie Miao
Mustafa Safdari
Austin Huang
Aakanksha Chowdhery
Sharan Narang
Noah Fiedel
Aleksandra Faust
AI4CE
138
70
0
08 Oct 2022
Pix2Struct: Screenshot Parsing as Pretraining for Visual Language
  Understanding
Pix2Struct: Screenshot Parsing as Pretraining for Visual Language Understanding
Kenton Lee
Mandar Joshi
Iulia Turc
Hexiang Hu
Fangyu Liu
Julian Martin Eisenschlos
Urvashi Khandelwal
Peter Shaw
Ming-Wei Chang
Kristina Toutanova
CLIP
VLM
163
263
0
07 Oct 2022
ReAct: Synergizing Reasoning and Acting in Language Models
ReAct: Synergizing Reasoning and Acting in Language Models
Shunyu Yao
Jeffrey Zhao
Dian Yu
Nan Du
Izhak Shafran
Karthik Narasimhan
Yuan Cao
LLMAG
ReLM
LRM
240
2,494
0
06 Oct 2022
Screen2Words: Automatic Mobile UI Summarization with Multimodal Learning
Screen2Words: Automatic Mobile UI Summarization with Multimodal Learning
Bryan Wang
Gang Li
Xin Zhou
Zhourong Chen
Tovi Grossman
Yang Li
167
152
0
07 Aug 2021
Google's Neural Machine Translation System: Bridging the Gap between
  Human and Machine Translation
Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation
Yonghui Wu
M. Schuster
Z. Chen
Quoc V. Le
Mohammad Norouzi
...
Alex Rudnick
Oriol Vinyals
G. Corrado
Macduff Hughes
J. Dean
AIMat
716
6,743
0
26 Sep 2016
1