ResearchTrend.AI
  • Papers
  • Communities
  • Organizations
  • Events
  • Blog
  • Pricing
  • Feedback
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2407.15711
  4. Cited By
AssistantBench: Can Web Agents Solve Realistic and Time-Consuming Tasks?

AssistantBench: Can Web Agents Solve Realistic and Time-Consuming Tasks?

22 July 2024
Ori Yoran
S. Amouyal
Chaitanya Malaviya
Ben Bogin
Ofir Press
Jonathan Berant
    LLMAG
ArXiv (abs)PDFHTMLHuggingFace (9 upvotes)

Papers citing "AssistantBench: Can Web Agents Solve Realistic and Time-Consuming Tasks?"

38 / 38 papers shown
Title
Interaction-Driven Browsing: A Human-in-the-Loop Conceptual Framework Informed by Human Web Browsing for Browser-Using Agents
Interaction-Driven Browsing: A Human-in-the-Loop Conceptual Framework Informed by Human Web Browsing for Browser-Using Agents
Hyeonggeun Yun
Jinkyu Jang
0
0
0
15 Sep 2025
Abduct, Act, Predict: Scaffolding Causal Inference for Automated Failure Attribution in Multi-Agent Systems
Abduct, Act, Predict: Scaffolding Causal Inference for Automated Failure Attribution in Multi-Agent Systems
Alva West
Yixuan Weng
Minjun Zhu
Zhen Lin
Yue Zhang
Yue Zhang
3
0
0
12 Sep 2025
WebMMU: A Benchmark for Multimodal Multilingual Website Understanding and Code Generation
WebMMU: A Benchmark for Multimodal Multilingual Website Understanding and Code Generation
Rabiul Awal
Mahsa Massoud
Aarash Feizi
Zichao Li
Suyuchen Wang
...
Siva Reddy
Juan A. Rodriguez
Perouz Taslakian
Spandana Gella
Sai Rajeswar
LRM
32
0
0
22 Aug 2025
MCP-Universe: Benchmarking Large Language Models with Real-World Model Context Protocol Servers
MCP-Universe: Benchmarking Large Language Models with Real-World Model Context Protocol Servers
Ziyang Luo
Zhiqi Shen
Wenzhuo Yang
Zirui Zhao
Prathyusha Jwalapuram
Amrita Saha
Doyen Sahoo
Silvio Savarese
Caiming Xiong
Junnan Li
ELM
56
1
0
20 Aug 2025
MoNaCo: More Natural and Complex Questions for Reasoning Across Dozens of Documents
MoNaCo: More Natural and Complex Questions for Reasoning Across Dozens of Documents
Tomer Wolfson
H. Trivedi
Mor Geva
Yoav Goldberg
Dan Roth
Tushar Khot
Ashish Sabharwal
Reut Tsarfaty
RALMLRM
107
1
0
15 Aug 2025
WebDS: An End-to-End Benchmark for Web-based Data Science
WebDS: An End-to-End Benchmark for Web-based Data Science
Ethan Hsu
Hong Meng Yam
Ines Bouissou
Aaron Murali John
Raj Thota
...
G K Dharesan
Alexander Spangher
Shikhar Murty
Tenghao Huang
Christopher D. Manning
34
0
0
02 Aug 2025
SimuRA: Towards General Goal-Oriented Agent via Simulative Reasoning Architecture with LLM-Based World Model
SimuRA: Towards General Goal-Oriented Agent via Simulative Reasoning Architecture with LLM-Based World Model
Mingkai Deng
Jinyu Hou
Yilin Shen
Hongxia Jin
Graham Neubig
Zhiting Hu
Eric P. Xing
LLMAGLM&RoLRM
43
1
0
31 Jul 2025
Evaluation and Benchmarking of LLM Agents: A Survey
Evaluation and Benchmarking of LLM Agents: A Survey
Mahmoud Mohammadi
Yipeng Li
Jane Lo
Wendy Yip
LLMAGELM
60
3
0
29 Jul 2025
WebArXiv: Evaluating Multimodal Agents on Time-Invariant arXiv Tasks
WebArXiv: Evaluating Multimodal Agents on Time-Invariant arXiv Tasks
Zihao Sun
Ling Chen
LLMAG
70
0
0
01 Jul 2025
Deep Research Agents: A Systematic Examination And Roadmap
Deep Research Agents: A Systematic Examination And Roadmap
Y. Huang
Y. Chen
Haozheng Zhang
Kang Li
Huichi Zhou
...
Lifeng Shang
Songcen Xu
Jianye Hao
Youssef Attia El Hili
Jun Wang
LLMAG
47
14
0
22 Jun 2025
SHADE-Arena: Evaluating Sabotage and Monitoring in LLM Agents
SHADE-Arena: Evaluating Sabotage and Monitoring in LLM Agents
Jonathan Kutasov
Yuqi Sun
Paul Colognese
Teun van der Weij
Linda Petrini
...
Xiang Deng
Henry Sleight
Tyler Tracy
Buck Shlegeris
Joe Benton
LLMAG
146
3
0
17 Jun 2025
Invocable APIs derived from NL2SQL datasets for LLM Tool-Calling Evaluation
Invocable APIs derived from NL2SQL datasets for LLM Tool-Calling Evaluation
Benjamin Elder
Anupama Murthi
J. Kang
Ankita Rajaram Naik
Kiran Kate
Kinjal Basu
Danish Contractor
76
0
0
12 Jun 2025
EconWebArena: Benchmarking Autonomous Agents on Economic Tasks in Realistic Web Environments
Zefang Liu
Yinzhu Quan
85
2
0
09 Jun 2025
DeepShop: A Benchmark for Deep Research Shopping Agents
DeepShop: A Benchmark for Deep Research Shopping Agents
Yougang Lyu
Xiaoyu Zhang
Lingyong Yan
Maarten de Rijke
Zhaochun Ren
Xiuying Chen
149
6
0
03 Jun 2025
FullFront: Benchmarking MLLMs Across the Full Front-End Engineering Workflow
FullFront: Benchmarking MLLMs Across the Full Front-End Engineering Workflow
Haoyu Sun
Huichen Will Wang
Jiawei Gu
Linjie Li
Yu Cheng
VLM
170
1
0
23 May 2025
MAPS: A Multilingual Benchmark for Global Agent Performance and Security
MAPS: A Multilingual Benchmark for Global Agent Performance and Security
Omer Hofman
Jonathan Brokman
Oren Rachmil
Shamik Bose
Vikas Pahuja
Toshiya Shimizu
Trisha Starostina
Kelly Marchisio
Seraphina Goldfarb-Tarrant
Roman Vainshtein
116
0
0
21 May 2025
Mobile-Agent-V: A Video-Guided Approach for Effortless and Efficient Operational Knowledge Injection in Mobile Automation
Mobile-Agent-V: A Video-Guided Approach for Effortless and Efficient Operational Knowledge Injection in Mobile Automation
Junyang Wang
Haiyang Xu
Xi Zhang
Ming Yan
Ji Zhang
Fei Huang
Jitao Sang
247
1
0
20 May 2025
TRAIL: Trace Reasoning and Agentic Issue Localization
TRAIL: Trace Reasoning and Agentic Issue Localization
Darshan Deshpande
Varun Gangal
Hersh Mehta
Jitin Krishnan
Anand Kannappan
Rebecca Qian
227
8
0
13 May 2025
Which Agent Causes Task Failures and When? On Automated Failure Attribution of LLM Multi-Agent Systems
Which Agent Causes Task Failures and When? On Automated Failure Attribution of LLM Multi-Agent Systems
Shaokun Zhang
Ming Yin
Jieyu Zhang
Jing Liu
Zhiguang Han
...
Beibin Li
Chi Wang
Hongru Wang
Yuxiao Chen
Qingyun Wu
304
16
0
30 Apr 2025
REAL: Benchmarking Autonomous Agents on Deterministic Simulations of Real Websites
REAL: Benchmarking Autonomous Agents on Deterministic Simulations of Real Websites
Divyansh Garg
Shaun VanWeelden
Diego Caples
Andis Draguns
Nikil Ravi
...
Youngchul Joo
Jindong Gu
Charles London
Christian Schroeder de Witt
S. Motwani
262
8
0
15 Apr 2025
AgentRewardBench: Evaluating Automatic Evaluations of Web Agent Trajectories
AgentRewardBench: Evaluating Automatic Evaluations of Web Agent Trajectories
Xing Han Lù
Amirhossein Kazemnejad
Nicholas Meade
Arkil Patel
Dongchan Shin
Alejandra Zambrano
Karolina Stañczak
Peter Shaw
Christopher Pal
Siva Reddy
LLMAG
162
7
0
11 Apr 2025
Inducing Programmatic Skills for Agentic Tasks
Inducing Programmatic Skills for Agentic Tasks
Zora Z. Wang
Apurva Gandhi
Graham Neubig
Daniel Fried
LLMAG
154
7
0
09 Apr 2025
An Illusion of Progress? Assessing the Current State of Web Agents
An Illusion of Progress? Assessing the Current State of Web Agents
Tianci Xue
Weijian Qi
Tianneng Shi
Chan Hee Song
Boyu Gou
Basel Alomair
Huan Sun
Yu Su
LLMAGELM
510
23
1
02 Apr 2025
MLGym: A New Framework and Benchmark for Advancing AI Research Agents
MLGym: A New Framework and Benchmark for Advancing AI Research Agents
Deepak Nathani
Lovish Madaan
Nicholas Roberts
Nikolay Bashlykov
Ajay Menon
...
Tatiana Shavrina
Jakob Foerster
Yoram Bachrach
William Yang Wang
Roberta Raileanu
LLMAG
209
22
0
21 Feb 2025
Explorer: Scaling Exploration-driven Web Trajectory Synthesis for Multimodal Web Agents
Explorer: Scaling Exploration-driven Web Trajectory Synthesis for Multimodal Web Agents
Vardaan Pahuja
Yadong Lu
Corby Rosset
Boyu Gou
Arindam Mitra
Spencer Whitehead
Yu Su
Ahmed Awadallah
LLMAGLM&Ro
386
12
1
17 Feb 2025
Preventing Rogue Agents Improves Multi-Agent Collaboration
Preventing Rogue Agents Improves Multi-Agent Collaboration
Ohav Barbi
Ori Yoran
Mor Geva
168
4
0
09 Feb 2025
The AI Agent Index
The AI Agent Index
Stephen Casper
Luke Bailey
Rosco Hunter
Carson Ezell
Emma Cabalé
...
Phillip J. K. Christoffersen
A. Pinar Ozisik
Rakshit Trivedi
Dylan Hadfield-Menell
Noam Kolt
225
11
0
03 Feb 2025
Mobile-Agent-E: Self-Evolving Mobile Assistant for Complex Tasks
Mobile-Agent-E: Self-Evolving Mobile Assistant for Complex Tasks
Zhenhailong Wang
Haiyang Xu
Junyang Wang
Xi Zhang
Ming Yan
Junxuan Zhang
Fei Huang
Heng Ji
194
49
0
20 Jan 2025
WebWalker: Benchmarking LLMs in Web Traversal
WebWalker: Benchmarking LLMs in Web Traversal
Jialong Wu
Wenbiao Yin
Yong Jiang
Zhenglin Wang
Zekun Xi
...
Linhai Zhang
Yulan He
Deyu Zhou
Pengjun Xie
Fei Huang
206
39
0
13 Jan 2025
Cognitive Kernel: An Open-source Agent System towards Generalist Autopilots
Cognitive Kernel: An Open-source Agent System towards Generalist Autopilots
Han Zhang
Xiaoman Pan
Hongwei Wang
Kaixin Ma
Wenhao Yu
Dong Yu
LLMAG
205
5
0
03 Jan 2025
TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks
TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks
Frank F. Xu
Yufan Song
Boxuan Li
Yuxuan Tang
Kritanjali Jain
...
Wayne Chi
Lawrence Jang
Yiqing Xie
Shuyan Zhou
Graham Neubig
LLMAG
278
58
0
18 Dec 2024
The BrowserGym Ecosystem for Web Agent Research
The BrowserGym Ecosystem for Web Agent Research
Thibault Le Sellier De Chezelles
Maxime Gasse
Alexandre Lacoste
Alexandre Drouin
Massimo Caccia
...
Siva Reddy
Quentin Cappart
Graham Neubig
Ruslan Salakhutdinov
Nicolas Chapados
LLMAG
306
31
0
06 Dec 2024
CRMArena: Understanding the Capacity of LLM Agents to Perform Professional CRM Tasks in Realistic Environments
CRMArena: Understanding the Capacity of LLM Agents to Perform Professional CRM Tasks in Realistic Environments
Kung-Hsiang Huang
Akshara Prabhakar
Sidharth Dhawan
Yixin Mao
Huan Wang
Silvio Savarese
Caiming Xiong
Philippe Laban
Chien-Sheng Wu
174
19
0
04 Nov 2024
ST-WebAgentBench: A Benchmark for Evaluating Safety and Trustworthiness in Web Agents
ST-WebAgentBench: A Benchmark for Evaluating Safety and Trustworthiness in Web Agents
Ido Levy
Ben Wiesel
Sami Marreed
Alon Oved
Avi Yaeli
Segev Shlomov
LLMAG
268
31
0
09 Oct 2024
The BiGGen Bench: A Principled Benchmark for Fine-grained Evaluation of Language Models with Language Models
The BiGGen Bench: A Principled Benchmark for Fine-grained Evaluation of Language Models with Language Models
Seungone Kim
Juyoung Suk
Ji Yong Cho
Shayne Longpre
Chaeeun Kim
...
Sean Welleck
Graham Neubig
Moontae Lee
Kyungjae Lee
Minjoon Seo
ELMALMLM&MA
231
55
0
09 Jun 2024
AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents
AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents
Christopher Rawles
Sarah Clinckemaillie
Yifan Chang
Jonathan Waltz
Gabrielle Lau
...
Daniel Toyama
Robert Berry
Divya Tyamagundlu
Timothy Lillicrap
Oriana Riva
LLMAG
280
107
0
23 May 2024
Tur[k]ingBench: A Challenge Benchmark for Web Agents
Tur[k]ingBench: A Challenge Benchmark for Web Agents
Kevin Xu
Yeganeh Kordi
Kate Sanders
Yizhong Wang
Adam Byerly
Kate Sanders
Adam Byerly
Jingyu Zhang
Benjamin Van Durme
Daniel Khashabi
LLMAG
232
12
0
18 Mar 2024
Large Language Models for Information Retrieval: A Survey
Large Language Models for Information Retrieval: A Survey
Yutao Zhu
Huaying Yuan
Shuting Wang
Jiongnan Liu
Wenhan Liu
Chenlong Deng
Haonan Chen
Zheng Liu
Zhicheng Dou
Ji-Rong Wen
KELM
252
360
0
14 Aug 2023
1