Communities
Connect sessions
AI calendar
Organizations
Join Slack
Contact Sales
Search
Open menu
Home
Papers
2504.11543
Cited By
v1
v2 (latest)
REAL: Benchmarking Autonomous Agents on Deterministic Simulations of Real Websites
15 April 2025
Divyansh Garg
Shaun VanWeelden
Diego Caples
Andis Draguns
Nikil Ravi
Pranav Putta
Naman Garg
Tomas Abraham
Michael Lara
Federico Lopez
James Liu
Atharva Gundawar
Prannay Hebbar
Youngchul Joo
Jindong Gu
Charles London
Christian Schroeder de Witt
S. Motwani
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"REAL: Benchmarking Autonomous Agents on Deterministic Simulations of Real Websites"
20 / 20 papers shown
Title
OpenApps: Simulating Environment Variations to Measure UI-Agent Reliability
Karen Ullrich
Jingtong Su
Claudia Shi
Arjun Subramonian
Amir Bar
Ivan Evtimov
Nikolaos Tsilivis
Randall Balestriero
Julia Kempe
Mark Ibrahim
44
0
0
25 Nov 2025
Fara-7B: An Efficient Agentic Model for Computer Use
Ahmed Awadallah
Yash Lara
Raghav Magazine
Hussein Mozannar
Akshay Nambi
...
Corby Rosset
Alexey Taymanov
Vibhav Vineet
Spencer Whitehead
Andrew Zhao
40
0
0
24 Nov 2025
UI-CUBE: Enterprise-Grade Computer Use Agent Benchmarking Beyond Task Accuracy to Operational Reliability
Horia Cristescu
Charles Park
Trong Canh Nguyen
Sergiu Talmacel
Alexandru-Gabriel Ilie
Stefan Adam
ELM
100
0
0
21 Nov 2025
Towards Outcome-Oriented, Task-Agnostic Evaluation of AI Agents
Waseem Alshikh
Muayad Ali
Brian Kennedy
Dmytro Mozolevskyi
28
0
0
11 Nov 2025
SCUBA: Salesforce Computer Use Benchmark
Yutong Dai
Krithika Ramakrishnan
Jing Gu
M. Fernández
Yanqi Luo
...
Zhenyu Hu
Silvio Savarese
Caiming Xiong
Zeyuan Chen
Ran Xu
ELM
111
1
0
30 Sep 2025
WAREX: Web Agent Reliability Evaluation on Existing Benchmarks
Su Kara
Fazle Faisal
Suman Nath
100
0
0
28 Sep 2025
WebMall - A Multi-Shop Benchmark for Evaluating Web Agents [Technical Report]
Ralph Peeters
Aaron Steiner
Luca Schwarz
Julian Yuya Caspary
Christian Bizer
96
0
0
18 Aug 2025
NatureGAIA: Pushing the Frontiers of GUI Agents with a Challenging Benchmark and High-Quality Trajectory Dataset
Zihan Zheng
Tianle Cui
Chuwen Xie
Jiahui Zhang
Jiahui Pan
Lewei He
Qianglong Chen
LLMAG
152
0
0
02 Aug 2025
WebArXiv: Evaluating Multimodal Agents on Time-Invariant arXiv Tasks
Zihao Sun
Ling Chen
LLMAG
102
0
0
01 Jul 2025
Evolutionary Perspectives on the Evaluation of LLM-Based AI Agents: A Comprehensive Survey
Jiachen Zhu
Menghui Zhu
Renting Rui
Rong Shan
Congmin Zheng
...
Jianghao Lin
Weiwen Liu
Ruiming Tang
Yong Yu
Weinan Zhang
LLMAG
ELM
210
6
0
06 Jun 2025
RedTeamCUA: Realistic Adversarial Testing of Computer-Use Agents in Hybrid Web-OS Environments
Zeyi Liao
Jaylen Jones
Linxi Jiang
Eric Fosler-Lussier
Eric Fosler-Lussier
Yu-Chuan Su
Zhiqiang Lin
Huan Sun
ELM
311
10
0
28 May 2025
Open Challenges in Multi-Agent Security: Towards Secure Systems of Interacting AI Agents
Christian Schroeder de Witt
AAML
AI4CE
1.0K
29
0
04 May 2025
Survey on Evaluation of LLM-based Agents
Asaf Yehudai
Lilach Eden
Alan Li
Guy Uziel
Yilun Zhao
Roy Bar-Haim
Arman Cohan
Michal Shmueli-Scheuer
LLMAG
ELM
413
62
0
20 Mar 2025
SWEET-RL: Training Multi-Turn LLM Agents on Collaborative Reasoning Tasks
Yifei Zhou
Song Jiang
Yuandong Tian
Jason Weston
Sergey Levine
Sainbayar Sukhbaatar
Xian Li
LLMAG
LRM
323
45
0
19 Mar 2025
AI Agents: Evolution, Architecture, and Real-World Applications
Naveen Krishnan
LLMAG
LM&Ro
AI4TS
AI4CE
142
30
0
16 Mar 2025
Towards Enterprise-Ready Computer Using Generalist Agent
Sami Marreed
Alon Oved
Avi Yaeli
Segev Shlomov
Ido Levy
Aviad Sela
Aviad Sela
Asaf Adi
Nir Mashkif
LLMAG
251
11
0
24 Feb 2025
Towards System 2 Reasoning in LLMs: Learning How to Think With Meta Chain-of-Though
Robert Z. Sparks
Charlie Snell
Kanishk Gandhi
Alon Albalak
Anikait Singh
...
Dakota Mahan
Louis Castricato
Jan-Philipp Fränken
Nick Haber
Chelsea Finn
LRM
298
78
0
08 Jan 2025
TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks
Frank F. Xu
Yufan Song
Boxuan Li
Yuxuan Tang
Kritanjali Jain
...
Wayne Chi
Lawrence Jang
Yiqing Xie
Shuyan Zhou
Graham Neubig
ELM
567
87
0
18 Dec 2024
Beyond Browsing: API-Based Web Agents
Annual Meeting of the Association for Computational Linguistics (ACL), 2024
Yueqi Song
Frank F. Xu
Shuyan Zhou
Graham Neubig
481
44
0
21 Oct 2024
AgentOccam: A Simple Yet Strong Baseline for LLM-Based Web Agents
International Conference on Learning Representations (ICLR), 2024
Ke Yang
Yao Liu
Sapana Chaudhary
Rasool Fakoor
Pratik Chaudhari
George Karypis
Huzefa Rangwala
LLMAG
LM&Ro
438
59
0
17 Oct 2024
1